Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

12 mar., 2024

Introduction

The other day we studied basic R and some of the basic functions of the tidyverse to work with data objects: filter and slice to get rows, select to choose variables, arrange to order rows, and, most important: group_by to define groups in the data so that operations take place at the group level. We also saw left_join to put together two data sets using a common variable (by argument).

Today we will get data for global monitoring looks from the World Development Indicators database using wbstats, and learn the few basic tools we did not see yet:

You can again check the references provided in Studium, Timbers, Campbell, and Lee (2023Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2023. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. https://github.com/erikgahner/qpolr/raw/master/book/_book/qpolr.pdf.), Wickham and Grolemund (2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (2024Ismay, Chester, and Albert Y. Kim. 2024. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

New packages to install

We will use today tmap to draw maps. If you have not installed it yet, please do so now:

install.packages("tmap")

Getting data from the World Bank

It is very simple to use wbstats if you know the names of the indicators from the World Development Indicators database. Today we download the 3 indicators used in gapminder: life expectancy, GDP per capita, and the total population. We also get the database of countries that we can look at.

my_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN", 
  gdp_capita ="NY.GDP.PCAP.CD", 
  pop = "SP.POP.TOTL"
  )



wbind <- wb_data(my_indicators, start_date = 2021) 

countries <- wb_countries()

This is the structure of the two data sets.

str(wbind)
## tibble [217 × 7] (S3: tbl_df/tbl/data.frame)
##  $ iso2c          : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c          : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country        : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date           : num [1:217] 2021 2021 2021 2021 2021 ...
##  $ gdp_capita     : num [1:217] 29128 356 1927 6377 42072 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy: num [1:217] 74.6 62 61.6 76.5 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop            : num [1:217] 106537 40099462 34503774 2811666 79034 ...
##   ..- attr(*, "label")= chr "Population, total"
str(countries)
## tibble [296 × 18] (S3: tbl_df/tbl/data.frame)
##  $ iso3c             : chr [1:296] "ABW" "AFE" "AFG" "AFR" ...
##  $ iso2c             : chr [1:296] "AW" "ZH" "AF" "A9" ...
##  $ country           : chr [1:296] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
##  $ capital_city      : chr [1:296] "Oranjestad" NA "Kabul" NA ...
##  $ longitude         : num [1:296] -70 NA 69.2 NA NA ...
##  $ latitude          : num [1:296] 12.5 NA 34.5 NA NA ...
##  $ region_iso3c      : chr [1:296] "LCN" NA "SAS" NA ...
##  $ region_iso2c      : chr [1:296] "ZJ" NA "8S" NA ...
##  $ region            : chr [1:296] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
##  $ admin_region_iso3c: chr [1:296] NA NA "SAS" NA ...
##  $ admin_region_iso2c: chr [1:296] NA NA "8S" NA ...
##  $ admin_region      : chr [1:296] NA NA "South Asia" NA ...
##  $ income_level_iso3c: chr [1:296] "HIC" NA "LIC" NA ...
##  $ income_level_iso2c: chr [1:296] "XD" NA "XM" NA ...
##  $ income_level      : chr [1:296] "High income" "Aggregates" "Low income" "Aggregates" ...
##  $ lending_type_iso3c: chr [1:296] "LNX" NA "IDX" NA ...
##  $ lending_type_iso2c: chr [1:296] "XX" NA "XI" NA ...
##  $ lending_type      : chr [1:296] "Not classified" "Aggregates" "IDA" "Aggregates" ...

Joining data

As we saw the other day, there are three conditions to join data sets:

In our case there are three common variables: iso2c, iso3c and country. We can join based on any of them, although it is better to use a code variable.

When there are several common variable we can do any of these things:

wbind <- wbind |> 
  left_join(countries, by=c("iso2c","iso3c","country")) 
wbind <- wbind |>  
  left_join(countries) 
wbind <- wbind |> 
  left_join(countries |> select(-iso2c,-country), by="iso3c") 

Since we know the two data sets are compatible coming from the same source, We opt for the second (shortest) choice.

wbind <- wbind |> 
  left_join(countries) 
## Joining with `by = join_by(iso2c, iso3c, country, capital_city, longitude,
## latitude, region_iso3c, region_iso2c, region, admin_region_iso3c,
## admin_region_iso2c, admin_region, income_level_iso3c, income_level_iso2c,
## income_level, lending_type_iso3c, lending_type_iso2c, lending_type)`

Note the message telling us the variables used for joining.

Now wbind has all the information in country:

str(wbind)
## tibble [217 × 22] (S3: tbl_df/tbl/data.frame)
##  $ iso2c             : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c             : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country           : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date              : num [1:217] 2021 2021 2021 2021 2021 ...
##  $ gdp_capita        : num [1:217] 29128 356 1927 6377 42072 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy   : num [1:217] 74.6 62 61.6 76.5 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop               : num [1:217] 106537 40099462 34503774 2811666 79034 ...
##   ..- attr(*, "label")= chr "Population, total"
##  $ capital_city      : chr [1:217] "Oranjestad" "Kabul" "Luanda" "Tirane" ...
##  $ longitude         : num [1:217] -70.02 69.18 13.24 19.82 1.52 ...
##  $ latitude          : num [1:217] 12.52 34.52 -8.81 41.33 42.51 ...
##  $ region_iso3c      : chr [1:217] "LCN" "SAS" "SSF" "ECS" ...
##  $ region_iso2c      : chr [1:217] "ZJ" "8S" "ZG" "Z7" ...
##  $ region            : chr [1:217] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ admin_region_iso3c: chr [1:217] NA "SAS" "SSA" "ECA" ...
##  $ admin_region_iso2c: chr [1:217] NA "8S" "ZF" "7E" ...
##  $ admin_region      : chr [1:217] NA "South Asia" "Sub-Saharan Africa (excluding high income)" "Europe & Central Asia (excluding high income)" ...
##  $ income_level_iso3c: chr [1:217] "HIC" "LIC" "LMC" "UMC" ...
##  $ income_level_iso2c: chr [1:217] "XD" "XM" "XN" "XT" ...
##  $ income_level      : chr [1:217] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ lending_type_iso3c: chr [1:217] "LNX" "IDX" "IBD" "IBD" ...
##  $ lending_type_iso2c: chr [1:217] "XX" "XI" "XF" "XF" ...
##  $ lending_type      : chr [1:217] "Not classified" "IDA" "IBRD" "IBRD" ...

Data manipulation: What we knew

See what these applications from the other day do:

wbind |> 
  filter(admin_region=="South Asia") |> 
  select(country,gdp_capita:pop) |>
  flextable() |>  # Create nice table
  colformat_double(digits = 1) # Limit to one the decimal digits.

country

gdp_capita

life_expectancy

pop

Afghanistan

355.8

62.0

40,099,462.0

Bangladesh

2,457.9

72.4

169,356,251.0

Bhutan

3,560.2

71.8

777,486.0

India

2,238.1

67.2

1,407,563,842.0

Sri Lanka

3,996.6

76.4

22,156,000.0

Maldives

10,076.3

79.9

521,457.0

Nepal

1,229.4

68.5

30,034,989.0

Pakistan

1,506.1

66.1

231,402,117.0

The effects of group_by

wbind |>
  group_by(region) |> 
  arrange(life_expectancy) |> 
  slice(1) |> 
  select(region,country,life_expectancy) |>
  flextable() |>  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

region

country

life_expectancy

East Asia & Pacific

Nauru

64

Europe & Central Asia

Moldova

69

Latin America & Caribbean

Haiti

63

Middle East & North Africa

Djibouti

62

North America

United States

76

South Asia

Afghanistan

62

Sub-Saharan Africa

Chad

53

The effects of group_by

Creating new variables: mutate and renaming variables with rename

The GDP per capita is a measure of standard of living, how much money the average person makes. Sometimes we want to compare the size of the economy. We can create new variables specifying the formula that creates them. In this case, GDP=pop*gdp_capita. We should also pay attention to units. Since gdp_capita was measured in dollars, we get too many figures. We can divide by \(1e9\) to express the size of the economy in billions of dollars.

wbind |> 
  mutate(GDP=pop*gdp_capita/1e9) |> 
  arrange(-GDP) |> 
  select(country,GDP,gdp_capita,pop) |> 
  slice(1:10) |> 
  rename(`Per capita GDP`=gdp_capita,
         `Population`=pop) |> 
  flextable() |>  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

country

GDP

Per capita GDP

Population

United States

23,315

70,219

332,031,554

China

17,820

12,618

1,412,360,000

Japan

5,035

40,059

125,681,593

Germany

4,279

51,427

83,196,078

India

3,150

2,238

1,407,563,842

United Kingdom

3,142

46,870

67,026,292

France

2,959

43,671

67,764,304

Italy

2,155

36,449

59,133,173

Canada

2,007

52,515

38,226,498

Korea, Rep.

1,818

35,142

51,744,876

A very interesting possibility is to combine mutate with a group_by. This results in new columns that are calculated for each subtable separately. For instance, we can define at the region level the maximum life expectancy in the region minus the present one, and then check the countries whose life expectancy is farthest from their region’s leader:

wbind |> 
  group_by(region) |> 
  mutate(dif_LE=max(life_expectancy)-life_expectancy) |> 
  ungroup() |> 
  arrange(-dif_LE) |> 
  select(region, country,
         LE=life_expectancy,
         dif_LE) |> 
  slice(1:10) |> 
  flextable() |>  # Create nice table
  colformat_double(digits = 2) # 2 decimal digits.

region

country

LE

dif_LE

Sub-Saharan Africa

Chad

52.52

21.53

Sub-Saharan Africa

Nigeria

52.68

21.38

Sub-Saharan Africa

Lesotho

53.06

20.99

Middle East & North Africa

Djibouti

62.30

20.56

Sub-Saharan Africa

Central African Republic

53.90

20.16

Middle East & North Africa

Yemen, Rep.

63.75

19.11

Sub-Saharan Africa

South Sudan

54.98

19.08

Sub-Saharan Africa

Somalia

55.28

18.77

South Asia

Afghanistan

61.98

17.94

Sub-Saharan Africa

Eswatini

57.07

16.99

It is almost all countries in Africa! They are lagging behind Cabo Verde where life expectancy is 74.052.

For a simple table: For grouped data:

Creating summaries of data by group with summarize

summarize takes a data table (or subtable, if groups are defined), and defines a summary indicator for the complete table. It is generally used in combination with functions such as n(), for counts, max() or min() (maximum and minimum of a variable in a group), mean() or median(), providing average levels. In the case of average life expectancy at the region level, we should not use the simple average because some countries have more population than others. We should use the weighted average, giving more weight according to population.

## Average (weighted) levels by continent
wbind |> 
  group_by(region) |>  # Analyze each region separately
  drop_na(pop) |>      # Drop countries with no pop data (Eritrea)
  summarize(`Weighted average`=weighted.mean(life_expectancy,pop,na.rm=TRUE),
 `Simple average`=mean(life_expectancy,na.rm=TRUE),
 Countries=n(),
 Missing=sum(is.na(life_expectancy)),
 `Pop (millions)`=sum(pop)/1e6
            ) |>      # Calculate summary statistics by group
  arrange(`Weighted average`) |> # Order regions according to increasing average level
  flextable() |>  # Create nice table
  colformat_double(digits = 1) |> 
  set_caption("Average life expectancy by region")
Average life expectancy by region

region

Weighted average

Simple average

Countries

Missing

Pop (millions)

Sub-Saharan Africa

60.2

61.8

48

0

1,181.2

South Asia

67.6

70.5

8

0

1,901.9

Latin America & Caribbean

72.1

73.0

42

2

655.0

Middle East & North Africa

72.7

74.4

21

0

486.2

East Asia & Pacific

76.3

73.6

37

3

2,346.7

Europe & Central Asia

76.6

77.1

58

3

923.6

North America

77.0

79.4

3

0

370.3

We see the weighted average is generally smaller. This is because small countries (eg: Monaco, Andorra, Mauritius, Hong-Kong), tend to have higher life expectancy.

Basic, necessary, elements: Optional layers: # Doing graphs

The next day we will explore in detail how graphs are done using {ggplot2}. The idea is to:

Adding things to the graphic is done through + (instead of |>)!

Finally there is the possibility of adding more layers. This is a summary from this source

This is the code producing our graph. Try to make sense of each layer.

wbind |>
  ggplot(aes(
      x = gdp_capita, 
      y = life_expectancy
      )) +
  geom_point(aes(size=pop, color=region)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(
    labels = scales::dollar_format(),
    breaks = scales::log_breaks(n = 4)
  ) + 
  scale_size_continuous(
    labels = scales::number_format(scale = 1/1e6, suffix = "m"),
    breaks = seq(1e8,1e9, 2e8),
    range = c(1,20)
    ) +
  theme_minimal() +
  labs(
    title = "Life expectancy according to GDP per capita",
    x = "GDP per Capita",
    y = "Life expectancy (years)",
    size = "Population",
    color = NULL,
    caption = "Source: World Bank, through `wbstats`"
  ) 

Maps with ggplot2

Mapping the data is also relatively simple. We only need to specify a base map to which the color will depend on per-capita income. We can do it with ggplot2 replicating the vignette of {wbstats}.

One possibility is the world map in package tmap. We get it in our environment by:

library(tmap)
data(World)

The object is an sf (simple features) object. Basically a data object with one specific variable, geometry, that contains the polygons defining the country map.

str(World)
## Classes 'sf' and 'data.frame':   177 obs. of  16 variables:
##  $ iso_a3      : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
##  $ sovereignt  : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
##  $ continent   : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
##  $ area        : Units: [km^2] num  652860 1246700 27400 71252 2736690 ...
##  $ pop_est     : num  28400000 12799293 3639453 4798491 40913584 ...
##  $ pop_est_dens: num  43.5 10.3 132.8 67.3 15 ...
##  $ economy     : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
##  $ income_grp  : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
##  $ gdp_cap_est : num  784 8618 5993 38408 14027 ...
##  $ life_exp    : num  59.7 NA 77.3 NA 75.9 ...
##  $ well_being  : num  3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
##  $ footprint   : num  0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
##  $ inequality  : num  0.427 NA 0.165 NA 0.164 ...
##  $ HPI         : num  20.2 NA 36.8 NA 35.2 ...
##  $ geometry    :sfc_MULTIPOLYGON of length 177; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...
class(World)
## [1] "sf"         "data.frame"

It already has some variables that could be used for maps.

We can also see the map by:

ggplot(data = World) +
    geom_sf()

In order to map our data we have to bring it to the World object. Let us see how:

mi_mapa=World |> 
  filter(iso_a3 != "ATA") |> # remove Antarctica
  left_join(wbind |> rename(iso_a3=iso3c) |>      select(iso_a3,gdp_capita,                            pop,life_expectancy),
     by="iso_a3")

Now we do the new map

mi_mapa |> ggplot(aes(fill = life_expectancy)) +
  geom_sf() +
  scale_fill_viridis_c() +
  theme(legend.position="bottom") +
  labs(
    title = "Map of life expectancy in 2021",
    fill = NULL,
    caption = paste("Source: World Bank") 
  ) +
   coord_sf()