Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

01 mar., 2022

Introduction

The other day we studied basic R and some of the basic functions of the tidyverse to work with data objects: filter and slice to get rows, select to choose variables, arrange to order rows, and, most important: group_by to define groups in the data so that operations take place at the group level. We also saw left_join to put together two data sets using a common variable (by argument).

Today we will get data for global monitoring looks from the World Development Indicators database using wbstats, and learn the few basic tools we did not see yet:

You can again check the references provided in Studium, Timbers, Campbell, and Lee (2022Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2022. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. http://qpolr.com/.), Wickham and Grolemund (2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (2022Ismay, Chester, and Albert Y. Kim. 2022. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

New packages to install

We will use today tmap to draw maps. If you have not installed it yet, please do so now:

install.packages("tmap")

Getting data from the World Bank

It is very simple to use wbstats if you know the names of the indicators from the World Development Indicators database. Today we download the 3 indicators used in gapminder: life expectancy, GDP per capita, and the total population. We also get the database of countries that we can look at.

my_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN", 
  gdp_capita ="NY.GDP.PCAP.CD", 
  pop = "SP.POP.TOTL"
  )



wbind <- wb_data(my_indicators, start_date = 2019) 

countries <- wb_countries()

This is the structure of the two data sets.

str(wbind)
## tibble [217 x 7] (S3: tbl_df/tbl/data.frame)
##  $ iso2c          : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c          : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country        : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date           : num [1:217] 2019 2019 2019 2019 2019 ...
##  $ gdp_capita     : num [1:217] NA 494 2810 5396 40897 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy: num [1:217] 76.3 64.8 61.1 78.6 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop            : num [1:217] 106310 38041757 31825299 2854191 77146 ...
##   ..- attr(*, "label")= chr "Population, total"
str(countries)
## tibble [299 x 18] (S3: tbl_df/tbl/data.frame)
##  $ iso3c             : chr [1:299] "ABW" "AFE" "AFG" "AFR" ...
##  $ iso2c             : chr [1:299] "AW" "ZH" "AF" "A9" ...
##  $ country           : chr [1:299] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
##  $ capital_city      : chr [1:299] "Oranjestad" NA "Kabul" NA ...
##  $ longitude         : num [1:299] -70 NA 69.2 NA NA ...
##  $ latitude          : num [1:299] 12.5 NA 34.5 NA NA ...
##  $ region_iso3c      : chr [1:299] "LCN" NA "SAS" NA ...
##  $ region_iso2c      : chr [1:299] "ZJ" NA "8S" NA ...
##  $ region            : chr [1:299] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
##  $ admin_region_iso3c: chr [1:299] NA NA "SAS" NA ...
##  $ admin_region_iso2c: chr [1:299] NA NA "8S" NA ...
##  $ admin_region      : chr [1:299] NA NA "South Asia" NA ...
##  $ income_level_iso3c: chr [1:299] "HIC" NA "LIC" NA ...
##  $ income_level_iso2c: chr [1:299] "XD" NA "XM" NA ...
##  $ income_level      : chr [1:299] "High income" "Aggregates" "Low income" "Aggregates" ...
##  $ lending_type_iso3c: chr [1:299] "LNX" NA "IDX" NA ...
##  $ lending_type_iso2c: chr [1:299] "XX" NA "XI" NA ...
##  $ lending_type      : chr [1:299] "Not classified" "Aggregates" "IDA" "Aggregates" ...

Joining data

As we saw the other day, there are three conditions to join data sets:

In our case there are three common variables: iso2c, iso3c and country. We can join based on any of them, although it is better to use a code variable.

When there is more than one common variable we can do several things:

wbind <- wbind %>% %>% 
  left_join(countries, by=c("iso2c","iso3c","country")) 
wbind <- wbind %>% %>% 
  left_join(countries) 
wbind <- wbind %>% 
  left_join(countries %>% select(-iso2c,-country), by="iso3c") 

Since we know the two data sets are compatible coming from the same source, We opt for the second (shortest) choice.

wbind <- wbind %>% 
  left_join(countries) 
## Joining, by = c("iso2c", "iso3c", "country", "capital_city", "longitude", "latitude", "region_iso3c", "region_iso2c", "region", "admin_region_iso3c", "admin_region_iso2c", "admin_region", "income_level_iso3c", "income_level_iso2c", "income_level", "lending_type_iso3c", "lending_type_iso2c", "lending_type")

Note the message telling us the variables used for joining.

Now wbind has all the information in country:

str(wbind)
## tibble [217 x 22] (S3: tbl_df/tbl/data.frame)
##  $ iso2c             : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c             : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country           : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date              : num [1:217] 2019 2019 2019 2019 2019 ...
##  $ gdp_capita        : num [1:217] NA 494 2810 5396 40897 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy   : num [1:217] 76.3 64.8 61.1 78.6 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop               : num [1:217] 106310 38041757 31825299 2854191 77146 ...
##   ..- attr(*, "label")= chr "Population, total"
##  $ capital_city      : chr [1:217] "Oranjestad" "Kabul" "Luanda" "Tirane" ...
##  $ longitude         : num [1:217] -70.02 69.18 13.24 19.82 1.52 ...
##  $ latitude          : num [1:217] 12.52 34.52 -8.81 41.33 42.51 ...
##  $ region_iso3c      : chr [1:217] "LCN" "SAS" "SSF" "ECS" ...
##  $ region_iso2c      : chr [1:217] "ZJ" "8S" "ZG" "Z7" ...
##  $ region            : chr [1:217] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ admin_region_iso3c: chr [1:217] NA "SAS" "SSA" "ECA" ...
##  $ admin_region_iso2c: chr [1:217] NA "8S" "ZF" "7E" ...
##  $ admin_region      : chr [1:217] NA "South Asia" "Sub-Saharan Africa (excluding high income)" "Europe & Central Asia (excluding high income)" ...
##  $ income_level_iso3c: chr [1:217] "HIC" "LIC" "LMC" "UMC" ...
##  $ income_level_iso2c: chr [1:217] "XD" "XM" "XN" "XT" ...
##  $ income_level      : chr [1:217] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ lending_type_iso3c: chr [1:217] "LNX" "IDX" "IBD" "IBD" ...
##  $ lending_type_iso2c: chr [1:217] "XX" "XI" "XF" "XF" ...
##  $ lending_type      : chr [1:217] "Not classified" "IDA" "IBRD" "IBRD" ...

Data manipulation: What we knew

See what these applications from the other day do:

wbind %>% 
  filter(admin_region=="South Asia") %>% 
  select(country,gdp_capita:pop) %>%
  flextable() %>%  # Create nice table
  colformat_double(digits = 1) # Limit to one the decimal digits.

The effects of group_by

wbind %>%
  group_by(region) %>% 
  arrange(life_expectancy) %>% 
  slice(1) %>% 
  select(region,country,life_expectancy) %>%
  flextable() %>%  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

The effects of group_by

Creating new variables: mutate and renaming variables with rename

The GDP per capita is a measure of standard of living, how much money the average person makes. Sometimes we want to compare the size of the economy. We can create new variables specifying the formula that creates them. In this case, GDP=pop*gdp_capita. We should also pay attentions to units. Since gdp_capita was measured in dollars, we get too many figures. We can divide by \(1e9\) to express the size of the economy in billion dollars.

wbind %>% 
  mutate(GDP=pop*gdp_capita/1e9) %>% 
  arrange(-GDP) %>% 
  select(country,GDP,gdp_capita,pop) %>% 
  slice(1:10) %>% 
  rename(`Per capita GDP`=gdp_capita,
         `Population`=pop) %>% 
  flextable() %>%  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

A very interesting possibility is to combine mutate with a group_by. This results in new columns that are calculated for each subtable separately. For instance, we can define at the region level a variable defined as the maximum life expectancy in the region minus the present one, and then check the countries whose life expectancy is most far from their region’s leader:

wbind %>% 
  group_by(region) %>% 
  mutate(dif_LE=max(life_expectancy)-life_expectancy) %>% 
  ungroup() %>% 
  arrange(-dif_LE) %>% 
  select(region, country,
         LE=life_expectancy,
         dif_LE) %>% 
  slice(1:10) %>% 
  flextable() %>%  # Create nice table
  colformat_double(digits = 2) # 2 decimal digits.

It is all countries in Africa! They are lagging behind Mauritius where life expectancy is 74.2358537.

For a simple table: For grouped data:

Creating summaries of data by group with summarize

summarize takes a data table (or subtable, if groups are defined), and defines a summary indicator for the complete table. It is generally used in combination with functions such as n(), for counts, max() or min() (maximum and minimum of a variable in a group), mean() or median(), providing average levels. In the case of average life expectancy at the region level, we should not use the simple average because some countries have more population than others. We should use the weighted average, giving more weight according to population.

## Average (weighted) levels by continent
wbind %>% 
  group_by(region) %>%  # Analyze each region separately
  drop_na(pop) %>%      # Drop countries with no pop data (Eritrea)
  summarize(`Weighted average`=weighted.mean(life_expectancy,pop,na.rm=TRUE),
 `Simple average`=mean(life_expectancy,na.rm=TRUE),
 Countries=n(),
 Missing=sum(is.na(life_expectancy)),
 `Pop (millions)`=sum(pop)/1e6
            ) %>%      # Calculate summary statistics by group
  arrange(`Weighted average`) %>% # Order regions according to increasing average level
  flextable() %>%  # Create nice table
  colformat_double(digits = 1) %>% 
  set_caption("Average life expectancy by region")

We see the weighted average is generally smaller. This is because small countries (eg: Monaco, Andorra, Mauritius, Hong-Kong), tend to have higher life expectancy.

Basic, necessary, elements: Optional layers: # Doing graphs

The next day we will explore in detail how graphs are done using {ggplot2}. The idea is to:

Adding things to the graphic is done through + (instead of %>%)!

Finally there is the possibility of adding more layers. This is a summary from this source

This is the code producing our graph. Try to make sense of each layer.

wbind %>%
  ggplot(aes(
      x = gdp_capita, 
      y = life_expectancy
      )) +
  geom_point(aes(size=pop, color=region)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(
    labels = scales::dollar_format(),
    breaks = scales::log_breaks(n = 4)
  ) +
  scale_size_continuous(
    labels = scales::number_format(scale = 1/1e6, suffix = "m"),
    breaks = seq(1e8,1e9, 2e8),
    range = c(1,20)
    ) +
  theme_minimal() +
  labs(
    title = "Life expectancy according to GDP per capita",
    x = "GDP per Capita (log scale)",
    y = "Life expectancy (years)",
    size = "Population",
    color = NULL,
    caption = "Source: World Bank, through `wbstats`"
  ) 

Maps with ggplot2

Mapping the data is also relatively simple. We only need to specify a base map to which the color will depend on per-capita income. We can do it with ggplot2 replicating the vignette of {wbstats}.

One possibility is the world map in package tmap. We get it in our environment by:

library(tmap)
data(World)

The object is an sf (simple features) object. Basically a data object with one specific variable, geometry, that contains the polygons defining the country map.

str(World)
## Classes 'sf' and 'data.frame':   177 obs. of  16 variables:
##  $ iso_a3      : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
##  $ sovereignt  : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
##  $ continent   : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
##  $ area        : Units: [km^2] num  652860 1246700 27400 71252 2736690 ...
##  $ pop_est     : num  28400000 12799293 3639453 4798491 40913584 ...
##  $ pop_est_dens: num  43.5 10.3 132.8 67.3 15 ...
##  $ economy     : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
##  $ income_grp  : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
##  $ gdp_cap_est : num  784 8618 5993 38408 14027 ...
##  $ life_exp    : num  59.7 NA 77.3 NA 75.9 ...
##  $ well_being  : num  3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
##  $ footprint   : num  0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
##  $ inequality  : num  0.427 NA 0.165 NA 0.164 ...
##  $ HPI         : num  20.2 NA 36.8 NA 35.2 ...
##  $ geometry    :sfc_MULTIPOLYGON of length 177; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...
class(World)
## [1] "sf"         "data.frame"

It already has some variables that could be used for maps.

We can also see the map by:

ggplot(data = World) +
    geom_sf()

In order to map our data we have to bring it to the World object. Let us see how:

mi_mapa=World %>% 
  filter(iso_a3 != "ATA") %>% # remove Antarctica
  left_join(wbind %>% rename(iso_a3=iso3c) %>%      select(iso_a3,gdp_capita,                            pop,life_expectancy),
     by="iso_a3")

Now we do the new map

mi_mapa %>% ggplot(aes(fill = life_expectancy)) +
  geom_sf() +
  scale_fill_viridis_c() +
  theme(legend.position="bottom") +
  labs(
    title = "Map of life expectancy in 2019",
    fill = NULL,
    caption = paste("Source: World Bank") 
  ) +
   coord_sf()