library(tidyverse)
library(here)
library(visdat)
The data set world_map2
, a derivative and update of ggplot2::map_data("world")
, was created for easier mapping of Global Studies data in the Tidyverse. To not solely rely on matching country or region names, which can vary considerably per the data source, world_map2
contains following the ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. (To download and/or improve world_map2
: world_map2_project.rda).
The process behind making world_map2
is described in a separate markdown. This markdown showcases how to use world_map2
with Global Studies data. This markdown uses data from the Gapminder.org foundation as examples of typical Global Studies data. We want to show the annual results in Choropleth maps.
# our custom map data
load(here::here("data",
"tidy_data",
"maps",
"world_map2_project.rda"))
# raw Global Studies dat from Gapminder.org
gni_percapita <- read_csv(here::here("data",
"raw_data",
"gnipercapita_ppp_current_international.csv") )
# raw Global Studies dat from Gapminder.org
water_access <- read_csv(here::here("data",
"raw_data",
"at_least_basic_water_source_overall_access_percent.csv") )
# raw Global Studies dat from Gapminder.org
energy_capita <- read_csv(here::here("data",
"raw_data",
"energy_use_per_person.csv"))
The first data set, “Energy use, per person”, is from Gapminder.org but sourced from there to the World Bank. We will proceed by exploring it, creating a Tidy version, and then mapping the Tidy version. Our try at mapping will introduce and troubleshoot a common problem.
We see that the data is in a wide format, and not Tidy. It does not use ISO country codes. In fact, none of data sets available from Gapminder.org/data do so. So we will need to match country
names.
energy_capita %>% vis_dat()
energy_capita %>% glimpse()
## Rows: 169
## Columns: 57
## $ country <chr> "Albania", "Algeria", "Angola", "Antigua and Barbuda", "Argent~
## $ `1960` <dbl> NA, NA, NA, NA, NA, NA, 3060, 1550, NA, NA, NA, NA, NA, NA, 25~
## $ `1961` <dbl> NA, NA, NA, NA, NA, NA, 3120, 1550, NA, NA, NA, NA, NA, NA, 25~
## $ `1962` <dbl> NA, NA, NA, NA, NA, NA, 3170, 1680, NA, NA, NA, NA, NA, NA, 28~
## $ `1963` <dbl> NA, NA, NA, NA, NA, NA, 3280, 1820, NA, NA, NA, NA, NA, NA, 30~
## $ `1964` <dbl> NA, NA, NA, NA, NA, NA, 3350, 1860, NA, NA, NA, NA, NA, NA, 30~
## $ `1965` <dbl> NA, NA, NA, NA, NA, NA, 3460, 1850, NA, NA, NA, NA, NA, NA, 31~
## $ `1966` <dbl> NA, NA, NA, NA, NA, NA, 3550, 1900, NA, NA, NA, NA, NA, NA, 30~
## $ `1967` <dbl> NA, NA, NA, NA, NA, NA, 3690, 1920, NA, NA, NA, NA, NA, NA, 31~
## $ `1968` <dbl> NA, NA, NA, NA, NA, NA, 3760, 2050, NA, NA, NA, NA, NA, NA, 35~
## $ `1969` <dbl> NA, NA, NA, NA, NA, NA, 3790, 2180, NA, NA, NA, NA, NA, NA, 38~
## $ `1970` <dbl> NA, NA, NA, NA, NA, NA, 4060, 2420, NA, NA, NA, NA, NA, NA, 41~
## $ `1971` <dbl> 785.0, 233.0, 637.0, NA, 1390.0, NA, 3990.0, 2510.0, NA, NA, 6~
## $ `1972` <dbl> 866.0, 263.0, 663.0, NA, 1390.0, NA, 4040.0, 2630.0, NA, NA, 5~
## $ `1973` <dbl> 763.0, 307.0, 636.0, NA, 1420.0, NA, 4260.0, 2830.0, NA, NA, 8~
## $ `1974` <dbl> 777.0, 321.0, 624.0, NA, 1430.0, NA, 4290.0, 2730.0, NA, NA, 9~
## $ `1975` <dbl> 827, 332, 587, NA, 1390, NA, 4350, 2650, NA, NA, 8000, 96, NA,~
## $ `1976` <dbl> 891.0, 369.0, 559.0, NA, 1420.0, NA, 4410.0, 2870.0, NA, NA, 9~
## $ `1977` <dbl> 924.0, 401.0, 533.0, NA, 1430.0, NA, 4670.0, 2800.0, NA, NA, 8~
## $ `1978` <dbl> 1010, 480, 567, NA, 1440, NA, 4630, 2890, NA, NA, 7840, 100, N~
## $ `1979` <dbl> 864, 590, 556, NA, 1500, NA, 4680, 3140, NA, NA, 8810, 102, NA~
## $ `1980` <dbl> 1150, 583, 547, NA, 1500, NA, 4740, 3070, NA, NA, 7790, 106, N~
## $ `1981` <dbl> 989, 615, 532, NA, 1440, NA, 4690, 2900, NA, NA, 8300, 105, NA~
## $ `1982` <dbl> 967, 776, 506, NA, 1430, NA, 4820, 2830, NA, NA, 9070, 108, NA~
## $ `1983` <dbl> 1000, 813, 502, NA, 1430, NA, 4560, 2840, NA, NA, 8500, 108, N~
## $ `1984` <dbl> 1020, 781, 489, NA, 1460, NA, 4650, 2950, NA, NA, 8830, 106, N~
## $ `1985` <dbl> 917, 791, 501, NA, 1370, NA, 4600, 3050, NA, NA, 9920, 110, NA~
## $ `1986` <dbl> 964, 868, 489, NA, 1430, NA, 4620, 3060, NA, NA, 10300, 114, N~
## $ `1987` <dbl> 922, 833, 484, NA, 1480, NA, 4770, 3170, NA, NA, 9520, 110, NA~
## $ `1988` <dbl> 928, 855, 486, NA, 1510, NA, 4700, 3200, NA, NA, 10500, 117, N~
## $ `1989` <dbl> 896, 825, 480, NA, 1450, NA, 5000, 3140, NA, NA, 10200, 120, N~
## $ `1990` <dbl> 813, 861, 497, 1580, 1410, 2180, 5060, 3240, 3170, 2520, 10600~
## $ `1991` <dbl> 573, 889, 492, NA, 1440, 2320, 4930, 3420, 3090, NA, 10100, 11~
## $ `1992` <dbl> 418, 889, 479, NA, 1490, 1200, 4960, 3250, 2460, NA, 10800, 12~
## $ `1993` <dbl> 412, 872, 480, NA, 1470, 652, 5150, 3260, 2180, NA, 11100, 127~
## $ `1994` <dbl> 441, 824, 471, NA, 1550, 420, 5090, 3230, 1950, NA, 11600, 130~
## $ `1995` <dbl> 417, 843, 456, NA, 1550, 511, 5130, 3370, 1810, NA, 11400, 138~
## $ `1996` <dbl> 448, 802, 454, NA, 1590, 562, 5390, 3580, 1510, NA, 11100, 136~
## $ `1997` <dbl> 385, 809, 449, NA, 1620, 594, 5470, 3550, 1440, NA, 12200, 139~
## $ `1998` <dbl> 427, 825, 434, NA, 1660, 610, 5550, 3610, 1490, NA, 12400, 142~
## $ `1999` <dbl> 576, 868, 441, NA, 1670, 594, 5610, 3590, 1370, NA, 11900, 141~
## $ `2000` <dbl> 580, 870, 439, NA, 1670, 656, 5640, 3570, 1400, NA, 12000, 143~
## $ `2001` <dbl> 597, 860, 443, NA, 1570, 657, 5450, 3760, 1410, NA, 11700, 154~
## $ `2002` <dbl> 660, 908, 448, NA, 1510, 618, 5570, 3770, 1410, NA, 11500, 154~
## $ `2003` <dbl> 648, 953, 468, NA, 1600, 656, 5570, 3970, 1480, NA, 11600, 160~
## $ `2004` <dbl> 715, 952, 465, 1680, 1730, 698, 5600, 4010, 1540, 2080, 10900,~
## $ `2005` <dbl> 720, 978, 434, 1680, 1720, 843, 5560, 4090, 1600, 2140, 11700,~
## $ `2006` <dbl> 707, 1030, 459, 1730, 1850, 865, 5710, 4080, 1560, 2130, 11600~
## $ `2007` <dbl> 680, 1080, 472, 1740, 1860, 973, 5870, 4020, 1410, 2100, 11200~
## $ `2008` <dbl> 711, 1070, 492, NA, 1940, 1030, 5960, 4030, 1520, NA, 11300, 1~
## $ `2009` <dbl> 732, 1150, 515, NA, 1870, 904, 5860, 3800, 1330, NA, 10300, 19~
## $ `2010` <dbl> 729, 1110, 521, NA, 1930, 863, 5790, 4050, 1280, NA, 10200, 20~
## $ `2011` <dbl> 765, 1140, 522, NA, 1950, 944, 5750, 3920, 1370, NA, 9910, 212~
## $ `2012` <dbl> 688, 1230, 552, NA, 1940, 1030, 5580, 3890, 1470, NA, 9660, 22~
## $ `2013` <dbl> 801, 1250, 534, NA, 1970, 1000, 5470, 3920, 1470, NA, 10400, 2~
## $ `2014` <dbl> 808, 1330, 545, NA, 2030, 1020, 5330, 3760, 1500, NA, 10600, 2~
## $ `2015` <dbl> NA, NA, NA, NA, NA, NA, 5480, 3800, NA, NA, NA, NA, NA, NA, 46~
When converting to a Tidy data format, we want to make sure the variable year
is not a character. But first, let’s check to see if the country names in energy_capita
have a match in our mapping data.
setdiff(energy_capita$country,world_map2$country)
## character(0)
setdiff(world_map2$country, energy_capita$country) %>%
enframe(name = NULL, value ="diff")
## # A tibble: 84 x 1
## diff
## <chr>
## 1 Afghanistan
## 2 American Samoa
## 3 Andorra
## 4 Anguilla
## 5 Antarctica
## 6 Aruba
## 7 Ascension Island
## 8 Azores
## 9 Bermuda
## 10 Bonaire
## # ... with 74 more rows
Yes! All the values for the country
variable in our Global Studies data have a match in world_map2
. We have only 169 countries in energy_capita
, so this comprises only a subset of the countries in world_map2
.
energy_tidy <- energy_capita %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "energy")
energy_tidy %>% vis_dat()
energy_tidy %>% glimpse()
## Rows: 9,464
## Columns: 3
## $ country <chr> "Albania", "Albania", "Albania", "Albania", "Albania", "Albani~
## $ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 19~
## $ energy <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 785, 866, 763, 777~
We have energy_tidy
. It nows seems a simple step to left_join
it to the map data, and then plot the energy values for a given year. But this will result in void spaces on our map.
energy_tidy %>%
filter(year == 2004) %>%
left_join(world_map2, by = "country") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "Energy Use\nPer Capita",
title = "Gapminder Data: 2004") +
theme_void()
The white spaces for Afghanistan and various nations in Africa and Southeast Asia are unnecessarily confusing. Although these nations were not reported on in energy_tidy
, that does not mean they were replaced by ocean or otherwise ceased to exist.
We want the map to fail gracefully. Missing countries would be better treated as NA cases. And NA cases should be displayed. knowing that we have no data for certain nations is both important and useful.
So we want to expand the data set before we join it with the mapping data. The Tidyverse verb complete() helps solve this problem. For the missing energy
values, we will assign NA
using replace_na()
. This ensures that our plot will use all listed countries, showing the NA
when applicable. Again, we want both graceful failure and to know when we are missing information.
energy_tidy %>%
filter(year == 2004) %>%
complete(country = world_map2$country,
fill = (list(energy = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2004)) %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "Energy Use\nPer Capita",
title = "Gapminder Data: 2000") +
theme_void()
This version fails much more gracefully. It also makes clearer for what regions of the world we lack data.
Let’s say that someone thinks Antarctica justs wastes space in this or like plots as it seldom generates economic, political science, or public health data. We can remove it with a simple filter call. Likewise, we can make the map interactive with one additional library and function call: plotly and ggplotly()
.
For plotly::ggplotly()
, the default legend guide will not carry over. We can solve this by changing the title.
In the code below, this is broken down into three steps. First, create the data set for the plot. Second, create the ggplot object. Third, display it as plotly graphic.
# Map data
enegry_dat_2004 <- energy_tidy %>%
filter(year == 2004) %>%
complete(country = world_map2$country,
fill = (list(energy = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2004))
# ggplot object
enegry_map_2004 <- enegry_dat_2004 %>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = energy) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "Energy Use Per Capita (2004)") +
theme_void()
# interactive version
plotly::ggplotly(enegry_map_2004)
So we now have a procedure. This project originated with a course entitled “Telling Stories with Data” (TSD), which introduced non-STEM majors to the Tidyverse and basic data visualization and analysis. Beginning students find it useful to have a proven model to follow. We have one now for our Choropleth maps, Let’s apply it to two other Gapminder.org data sets.
The second data set, “At least basic water source, overall access”, is likewise from Gapminder.org but sourced from there to the UN’s Millennium Development Goals. We will apply the procedure – the “recipe” of steps – developed when mapping Energy as above.
The data set “At least basic water source, overall access” has values for 194 countries, some NA
, for the years 2000 to 2017.
water_access %>% vis_dat()
water_access %>% glimpse()
## Rows: 194
## Columns: 19
## $ country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angola", "Ant~
## $ `2000` <dbl> 0.278, 0.879, 0.898, 1.000, 0.411, 0.983, 0.962, 0.951, 0.997,~
## $ `2001` <dbl> 0.278, 0.879, 0.901, 1.000, 0.423, 0.981, 0.965, 0.954, 0.997,~
## $ `2002` <dbl> 0.299, 0.879, 0.904, 1.000, 0.434, 0.980, 0.967, 0.957, 0.997,~
## $ `2003` <dbl> 0.320, 0.879, 0.907, 1.000, 0.444, 0.979, 0.969, 0.960, 0.997,~
## $ `2004` <dbl> 0.341, 0.879, 0.909, 1.000, 0.454, 0.978, 0.972, 0.963, 0.998,~
## $ `2005` <dbl> 0.363, 0.879, 0.912, 1.000, 0.463, 0.977, 0.974, 0.967, 0.998,~
## $ `2006` <dbl> 0.384, 0.879, 0.914, 1.000, 0.472, 0.976, 0.976, 0.970, 0.998,~
## $ `2007` <dbl> 0.408, 0.879, 0.917, 1.000, 0.480, 0.974, 0.978, 0.973, 0.998,~
## $ `2008` <dbl> 0.433, 0.879, 0.919, 1.000, 0.488, 0.973, 0.981, 0.976, 0.998,~
## $ `2009` <dbl> 0.458, 0.879, 0.922, 1.000, 0.496, 0.972, 0.983, 0.979, 0.999,~
## $ `2010` <dbl> 0.484, 0.878, 0.924, 1.000, 0.504, 0.971, 0.985, 0.983, 0.999,~
## $ `2011` <dbl> 0.509, 0.878, 0.926, 1.000, 0.512, 0.970, 0.987, 0.986, 0.999,~
## $ `2012` <dbl> 0.535, 0.879, 0.928, 1.000, 0.520, 0.969, 0.989, 0.989, 0.999,~
## $ `2013` <dbl> 0.562, 0.895, 0.931, 1.000, 0.528, 0.967, 0.989, 0.993, 0.999,~
## $ `2014` <dbl> 0.588, 0.910, 0.933, 1.000, 0.535, 0.967, 0.990, 0.996, 1.000,~
## $ `2015` <dbl> 0.615, 0.910, 0.935, 1.000, 0.543, 0.967, 0.991, 0.997, 1.000,~
## $ `2016` <dbl> 0.643, 0.910, 0.935, 1.000, 0.551, 0.967, 0.991, 0.999, 1.000,~
## $ `2017` <dbl> 0.671, 0.910, 0.936, 1.000, 0.558, 0.967, NA, 0.999, 1.000, 1.~
This data set again is in wide format, and not Tidy. But we know how to fix, and can largely recycle our solution for energy_capita
to energy_tidy
. We now need to check the how well the country names match – do we have coverage?
setdiff(water_access$country,world_map2$country)
## character(0)
setdiff(world_map2$country, water_access$country) %>%
enframe(name = NULL, value ="diff")
## # A tibble: 59 x 1
## diff
## <chr>
## 1 American Samoa
## 2 Anguilla
## 3 Antarctica
## 4 Aruba
## 5 Ascension Island
## 6 Azores
## 7 Bermuda
## 8 Bonaire
## 9 Canary Islands
## 10 Cayman Islands
## # ... with 49 more rows
So our map data provides complete coverage for our statistical data. Now, let’s Tidy the data, and then use complete()
with left_join()
to create our plot data sets.
water_tidy <- water_access %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "water")
water_tidy %>%
vis_dat()
water_tidy %>%
glimpse()
## Rows: 3,492
## Columns: 3
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A~
## $ year <int> 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 20~
## $ water <dbl> 0.278, 0.278, 0.299, 0.320, 0.341, 0.363, 0.384, 0.408, 0.433,~
All good.
water_dat_2010 <- water_tidy %>%
filter(year == 2010) %>%
complete(country = world_map2$country,
fill = (list(water = NA )) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2010))
water_mpa_2010 <- water_dat_2010 %>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = water) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "Basic Water Access (2010)") +
theme_void()
water_mpa_2010
plotly::ggplotly(water_mpa_2010)
Since we have a proven procedure, we can now tackle a slightly more difficult case.
Our third data set, “GNI per capita (PPP, current international $)”, is likewise from Gapminder.org, but sourced from there to the World Bank. It also regretfully does not use any ISO codes for identifying the countries reported on. We are back to matching names.
This data is wide, not Tidy, and has one additional issue. It reports on 196 countries over the time span 1990-2019. The values for GNI Per Capita should be numeric. Instead, they are character data. Values under 10000 appear as numbers; values over 10000 are expressed as k-units: for example, 13.2k for 13200.
gni_percapita %>%
vis_dat()
gni_percapita %>%
glimpse()
## Rows: 196
## Columns: 31
## $ country <chr> "Aruba", "Afghanistan", "Angola", "Albania", "United Arab Emir~
## $ `1990` <chr> "23.1k", NA, "2980", "2550", NA, "6880", "2670", "10.5k", "16.~
## $ `1991` <chr> "24.8k", NA, "3300", "1880", NA, "7760", "2550", "11.3k", "17.~
## $ `1992` <chr> "25.4k", NA, "1620", "1740", NA, "8560", "1290", "11.6k", "17.~
## $ `1993` <chr> "26.1k", NA, "1490", "2110", NA, "9410", "1480", "12.4k", "18.~
## $ `1994` <chr> "27.3k", NA, "1260", "2300", NA, "10k", "1620", "13.2k", "19.6~
## $ `1995` <chr> "27.4k", NA, "2340", "2710", NA, "9790", "1870", "12.6k", "20.~
## $ `1996` <chr> "27.2k", NA, "2410", "3050", NA, "10.4k", "2050", "13.5k", "21~
## $ `1997` <chr> "28.8k", NA, "2820", "2780", NA, "11.3k", "2240", "14.3k", "22~
## $ `1998` <chr> "29.4k", NA, "2810", "3110", NA, "11.7k", "2390", "14.9k", "23~
## $ `1999` <chr> "29k", NA, "2630", "3550", NA, "11.3k", "2510", "15.3k", "24.5~
## $ `2000` <chr> "31.6k", NA, "2780", "3980", "103k", "11.3k", "2740", "16k", "~
## $ `2001` <chr> "30.1k", NA, "2890", "4460", "101k", "10.9k", "3090", "15.7k",~
## $ `2002` <chr> "27.8k", NA, "3370", "4800", "96.3k", "9540", "3580", "15.5k",~
## $ `2003` <chr> "29.6k", NA, "3450", "5150", "99.2k", "10.6k", "4180", "16.7k"~
## $ `2004` <chr> "31.9k", NA, "3760", "5560", "102k", "10.9k", "4810", "17.8k",~
## $ `2005` <chr> "27.1k", NA, "4290", "5990", "99.2k", "12k", "5680", "19.4k", ~
## $ `2006` <chr> "33.6k", NA, "4710", "6750", "97.7k", "14.5k", "6720", "22.2k"~
## $ `2007` <chr> "31.2k", NA, "5320", "7480", "89.9k", "16.1k", "7960", "24.6k"~
## $ `2008` <chr> "35.7k", NA, "5560", "8260", "80.7k", "17k", "8790", "24.5k", ~
## $ `2009` <chr> "32.4k", "1520", "5820", "8680", "68.9k", "15.8k", "7530", "21~
## $ `2010` <chr> "30.7k", "1710", "5950", "9540", "64.9k", "17.4k", "7880", "20~
## $ `2011` <chr> "31k", "1700", "6130", "10.2k", "67.7k", "18.8k", "8310", "19.~
## $ `2012` <chr> "32k", "1920", "6810", "10.4k", "69.2k", "19.1k", "9740", "18.~
## $ `2013` <chr> "34.3k", "2020", "7130", "10.8k", "70.5k", "19.6k", "10.4k", "~
## $ `2014` <chr> "35.1k", "2070", "7680", "11.4k", "73.7k", "19.3k", "10.5k", "~
## $ `2015` <chr> "35.2k", "2110", "6960", "11.8k", "65.2k", "19.7k", "10.4k", "~
## $ `2016` <chr> "35.7k", "2150", "6730", "12.2k", "64.3k", "19.9k", "10.9k", "~
## $ `2017` <chr> "36.3k", "2230", "6860", "13.1k", "67.7k", "23k", "12.5k", "18~
## $ `2018` <chr> NA, "2260", "6550", "13.8k", "68.8k", "22.4k", "13.2k", "20.5k~
## $ `2019` <chr> NA, "2330", "6390", "14.3k", "70.2k", "22.1k", "14.5k", "21.5k~
To properly parse this data, we need to explore a bit more and think our way through the problem. We also need to check the country coverage.
setdiff(gni_percapita$country, world_map2$country) %>%
enframe(name = NULL, value ="diff")
## # A tibble: 2 x 1
## diff
## <chr>
## 1 Curaçao
## 2 Sint Maarten (Dutch part)
setdiff(world_map2$country, gni_percapita$country) %>%
enframe(name = NULL, value ="diff")
## # A tibble: 59 x 1
## diff
## <chr>
## 1 American Samoa
## 2 Andorra
## 3 Anguilla
## 4 Antarctica
## 5 Ascension Island
## 6 Azores
## 7 Bonaire
## 8 Canary Islands
## 9 Chagos Archipelago
## 10 Christmas Island
## # ... with 49 more rows
We have two mismatches: “Curaçao”, which uses the special character “ç”; and “Sint Maarten (Dutch part)”, which contains the addition of “(Dutch part)”. Although both these names are now becoming more regularly used as the short designations for these countries, we will still normalize them to simpler versions: “Curaçao” will be simplified to “Curacao”, and “Sint Maarten (Dutch part)” to “Sint Maarten”. Why? Special characters do not always play well in data sets or in coding, and the parenthetical information can also cause problems. It would be better if the data sets used the ISO country codes in addition to the short version of the country names. But the Gapminder.org data sets so far do not.
Now, back to the bigger issue. We need to convert the GNI Per Capita values to numbers. But the values previously expressed as something k need to be multiplied by 1000.
We have several ways of doing this. Below is one solution that relies on a cut-off value of 175. This assumes – based an exploration of the data – that we have no nations in this set with a reported GNI Per Capita greater than 199000, and that we have no nations with a reported GNI Per Capita less than 200. The top and bottom values will shortly be displayed.
gni_tidy <- gni_percapita %>%
pivot_longer(cols = !country,
names_to = "year",
names_transform = list(year = as.integer),
values_to = "gni_ppp_cap") %>%
mutate(gni_ppp_cap = readr::parse_number(gni_ppp_cap) )%>%
mutate(gni_ppp_cap = case_when(gni_ppp_cap < 200 ~ gni_ppp_cap * 1000,
TRUE ~ gni_ppp_cap) )
gni_tidy <- gni_tidy %>%
mutate( country = case_when(country == "Curaçao" ~ "Curacao",
country == "Sint Maarten (Dutch part)" ~ "Sint Maarten" ,
TRUE ~ country) )
# Brief check
gni_tidy %>% slice_min(gni_ppp_cap, n = 4)
## # A tibble: 4 x 3
## country year gni_ppp_cap
## <chr> <int> <dbl>
## 1 Mozambique 1992 270
## 2 Mozambique 1991 300
## 3 Mozambique 1993 300
## 4 Mozambique 1994 310
gni_tidy %>% slice_max(gni_ppp_cap, n = 4)
## # A tibble: 4 x 3
## country year gni_ppp_cap
## <chr> <int> <dbl>
## 1 Macao, China 2014 132000
## 2 Qatar 2012 132000
## 3 Qatar 2013 131000
## 4 Macao, China 2013 129000
So as it turns out, our cut-off was effective. The lowest reported GNI Per Capita: 270, for Mozambique in 1992. The highest, 132000 for Qatar in 2012 and Macao in 2014.
We should now recheck country coverage.
setdiff(gni_tidy$country, world_map2$country)
## character(0)
gni_tidy %>%
vis_dat()
Complete coverage. And all variables the right types. We have mapping data for all the countries in gni_tidy
. We are now ready to make our maps. From here, we follow the earlier recipe.
gni_dat_2017 <- gni_tidy %>%
filter(year == 2017) %>%
complete(country = world_map2$country,
fill = (list(gni_ppp_cap = NA)) ) %>%
left_join(world_map2, by = "country") %>%
replace_na(list(year = 2017))
gni_map_2017 <- gni_dat_2017%>%
filter(code_3 != "ATA") %>%
ggplot(aes(x = long,
y = lat,
group = group,
label = country)) +
geom_polygon(aes(fill = gni_ppp_cap) )+
scale_fill_viridis_c(option = "C") +
labs(fill = "",
title = "GNI Per Capita for 2017 (in PPP dollars)") +
theme_void()
gni_map_2017
plotly::ggplotly(gni_map_2017 )
The three above case studies show how to use Global studies data andworld_map2
with complete()
to produce better Choropleth maps. As opposed to the defaultggplot2::map_data("world")
, the updated world_map2 both fails gracefully when we have missing countries or data, and contains the following standard ISO country codes: Alpha-2 code, Alpha-3 code, and Numeric code. As such, we can match by country
name to the majority of Gapminder.org data sets, and we can match by to any Global Studies data set which likewise uses one or more of the above ISO country codes. The bottom line: better graphs and improved interoperability while keeping it (relatively) simple in the Tidyverse.
Or, please improve and share: github.com/Thom-J-H/map_Gap_2_Tidy
Thomas J. Haslam
2021-08-14