Introduction

The other day we studied basic R and some of the basic functions of the tidyverse to work with data objects: filter and slice to get rows, select to choose variables, arrange to order rows, and, most important: group_by to define groups in the data so that operations take place at the group level. We also saw left_join to put together two data sets using a common variable (by argument).

Today we will get data for global monitoring looks from the World Development Indicators database using wbstats, and learn the few basic tools we did not see yet:

rename to change variable names, mutate to create new columns through calculation.
summarize to create a new data set of summary indicators where the new observations (rows) are the groups

You can again check the references provided in Studium, Timbers, Campbell, and Lee (⊕2023Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2023. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (⊕2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. https://github.com/erikgahner/qpolr/raw/master/book/_book/qpolr.pdf.), Wickham and Grolemund (⊕2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (⊕2024Ismay, Chester, and Albert Y. Kim. 2024. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

New packages to install

We will use today tmap to draw maps. If you have not installed it yet, please do so now:

install.packages("tmap")

Getting data from the World Bank

It is very simple to use wbstats if you know the names of the indicators from the World Development Indicators database. Today we download the 3 indicators used in gapminder: life expectancy, GDP per capita, and the total population. We also get the database of countries that we can look at.

my_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN", 
  gdp_capita ="NY.GDP.PCAP.CD", 
  pop = "SP.POP.TOTL"
  )



wbind <- wb_data(my_indicators, start_date = 2021) 

countries <- wb_countries()

This is the structure of the two data sets.

str(wbind)

## tibble [217 × 7] (S3: tbl_df/tbl/data.frame)
##  $ iso2c          : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c          : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country        : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date           : num [1:217] 2021 2021 2021 2021 2021 ...
##  $ gdp_capita     : num [1:217] 29128 356 1927 6377 42072 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy: num [1:217] 74.6 62 61.6 76.5 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop            : num [1:217] 106537 40099462 34503774 2811666 79034 ...
##   ..- attr(*, "label")= chr "Population, total"

str(countries)

## tibble [296 × 18] (S3: tbl_df/tbl/data.frame)
##  $ iso3c             : chr [1:296] "ABW" "AFE" "AFG" "AFR" ...
##  $ iso2c             : chr [1:296] "AW" "ZH" "AF" "A9" ...
##  $ country           : chr [1:296] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
##  $ capital_city      : chr [1:296] "Oranjestad" NA "Kabul" NA ...
##  $ longitude         : num [1:296] -70 NA 69.2 NA NA ...
##  $ latitude          : num [1:296] 12.5 NA 34.5 NA NA ...
##  $ region_iso3c      : chr [1:296] "LCN" NA "SAS" NA ...
##  $ region_iso2c      : chr [1:296] "ZJ" NA "8S" NA ...
##  $ region            : chr [1:296] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
##  $ admin_region_iso3c: chr [1:296] NA NA "SAS" NA ...
##  $ admin_region_iso2c: chr [1:296] NA NA "8S" NA ...
##  $ admin_region      : chr [1:296] NA NA "South Asia" NA ...
##  $ income_level_iso3c: chr [1:296] "HIC" NA "LIC" NA ...
##  $ income_level_iso2c: chr [1:296] "XD" NA "XM" NA ...
##  $ income_level      : chr [1:296] "High income" "Aggregates" "Low income" "Aggregates" ...
##  $ lending_type_iso3c: chr [1:296] "LNX" NA "IDX" NA ...
##  $ lending_type_iso2c: chr [1:296] "XX" NA "XI" NA ...
##  $ lending_type      : chr [1:296] "Not classified" "Aggregates" "IDA" "Aggregates" ...

Joining data

As we saw the other day, there are three conditions to join data sets:

They must have at least one variable in common.
The variables in common must have the same name and the same type of data.
If they do not, we should change the names or the type of data before joining.

In our case there are three common variables: iso2c, iso3c and country. We can join based on any of them, although it is better to use a code variable.

When there are several common variable we can do any of these things:

Specify all of them in the by argument.

wbind <- wbind |> 
  left_join(countries, by=c("iso2c","iso3c","country"))

If they share information, let R guess what are the by variables leaving it blank. This can be problematic if not all variables share the content, for instance, if the country names differ among the data sets.

wbind <- wbind |>  
  left_join(countries)

Remove from the second set the common variables we do not want to use in the join and use only the selected join variable/s. Otherwise R changes the names of the variables with common name in both objects adding a suffix since otherwise there would be two variables with the same name, something forbidden in a tibble. This strategy takes a bit longer but it is safer.

wbind <- wbind |> 
  left_join(countries |> select(-iso2c,-country), by="iso3c")

Since we know the two data sets are compatible coming from the same source, We opt for the second (shortest) choice.

wbind <- wbind |> 
  left_join(countries)

## Joining with `by = join_by(iso2c, iso3c, country, capital_city, longitude,
## latitude, region_iso3c, region_iso2c, region, admin_region_iso3c,
## admin_region_iso2c, admin_region, income_level_iso3c, income_level_iso2c,
## income_level, lending_type_iso3c, lending_type_iso2c, lending_type)`

Note the message telling us the variables used for joining.

Now wbind has all the information in country:

str(wbind)

## tibble [217 × 22] (S3: tbl_df/tbl/data.frame)
##  $ iso2c             : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c             : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country           : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date              : num [1:217] 2021 2021 2021 2021 2021 ...
##  $ gdp_capita        : num [1:217] 29128 356 1927 6377 42072 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy   : num [1:217] 74.6 62 61.6 76.5 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop               : num [1:217] 106537 40099462 34503774 2811666 79034 ...
##   ..- attr(*, "label")= chr "Population, total"
##  $ capital_city      : chr [1:217] "Oranjestad" "Kabul" "Luanda" "Tirane" ...
##  $ longitude         : num [1:217] -70.02 69.18 13.24 19.82 1.52 ...
##  $ latitude          : num [1:217] 12.52 34.52 -8.81 41.33 42.51 ...
##  $ region_iso3c      : chr [1:217] "LCN" "SAS" "SSF" "ECS" ...
##  $ region_iso2c      : chr [1:217] "ZJ" "8S" "ZG" "Z7" ...
##  $ region            : chr [1:217] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ admin_region_iso3c: chr [1:217] NA "SAS" "SSA" "ECA" ...
##  $ admin_region_iso2c: chr [1:217] NA "8S" "ZF" "7E" ...
##  $ admin_region      : chr [1:217] NA "South Asia" "Sub-Saharan Africa (excluding high income)" "Europe & Central Asia (excluding high income)" ...
##  $ income_level_iso3c: chr [1:217] "HIC" "LIC" "LMC" "UMC" ...
##  $ income_level_iso2c: chr [1:217] "XD" "XM" "XN" "XT" ...
##  $ income_level      : chr [1:217] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ lending_type_iso3c: chr [1:217] "LNX" "IDX" "IBD" "IBD" ...
##  $ lending_type_iso2c: chr [1:217] "XX" "XI" "XF" "XF" ...
##  $ lending_type      : chr [1:217] "Not classified" "IDA" "IBRD" "IBRD" ...

Data manipulation: What we knew

See what these applications from the other day do:

wbind |> 
  filter(admin_region=="South Asia") |> 
  select(country,gdp_capita:pop) |>
  flextable() |>  # Create nice table
  colformat_double(digits = 1) # Limit to one the decimal digits.

country	gdp_capita	life_expectancy	pop
Afghanistan	355.8	62.0	40,099,462.0
Bangladesh	2,457.9	72.4	169,356,251.0
Bhutan	3,560.2	71.8	777,486.0
India	2,238.1	67.2	1,407,563,842.0
Sri Lanka	3,996.6	76.4	22,156,000.0
Maldives	10,076.3	79.9	521,457.0
Nepal	1,229.4	68.5	30,034,989.0
Pakistan	1,506.1	66.1	231,402,117.0

⊕ The effects of group_by

wbind |>
  group_by(region) |> 
  arrange(life_expectancy) |> 
  slice(1) |> 
  select(region,country,life_expectancy) |>
  flextable() |>  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

region	country	life_expectancy
East Asia & Pacific	Nauru	64
Europe & Central Asia	Moldova	69
Latin America & Caribbean	Haiti	63
Middle East & North Africa	Djibouti	62
North America	United States	76
South Asia	Afghanistan	62
Sub-Saharan Africa	Chad	53

⊕ The effects of group_by

Creating new variables: `mutate` and renaming variables with `rename`

The GDP per capita is a measure of standard of living, how much money the average person makes. Sometimes we want to compare the size of the economy. We can create new variables specifying the formula that creates them. In this case, GDP=pop*gdp_capita. We should also pay attention to units. Since gdp_capita was measured in dollars, we get too many figures. We can divide by \(1e9\) to express the size of the economy in billions of dollars.

wbind |> 
  mutate(GDP=pop*gdp_capita/1e9) |> 
  arrange(-GDP) |> 
  select(country,GDP,gdp_capita,pop) |> 
  slice(1:10) |> 
  rename(`Per capita GDP`=gdp_capita,
         `Population`=pop) |> 
  flextable() |>  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

country	GDP	Per capita GDP	Population
United States	23,315	70,219	332,031,554
China	17,820	12,618	1,412,360,000
Japan	5,035	40,059	125,681,593
Germany	4,279	51,427	83,196,078
India	3,150	2,238	1,407,563,842
United Kingdom	3,142	46,870	67,026,292
France	2,959	43,671	67,764,304
Italy	2,155	36,449	59,133,173
Canada	2,007	52,515	38,226,498
Korea, Rep.	1,818	35,142	51,744,876

A very interesting possibility is to combine mutate with a group_by. This results in new columns that are calculated for each subtable separately. For instance, we can define at the region level the maximum life expectancy in the region minus the present one, and then check the countries whose life expectancy is farthest from their region’s leader:

wbind |> 
  group_by(region) |> 
  mutate(dif_LE=max(life_expectancy)-life_expectancy) |> 
  ungroup() |> 
  arrange(-dif_LE) |> 
  select(region, country,
         LE=life_expectancy,
         dif_LE) |> 
  slice(1:10) |> 
  flextable() |>  # Create nice table
  colformat_double(digits = 2) # 2 decimal digits.

region	country	LE	dif_LE
Sub-Saharan Africa	Chad	52.52	21.53
Sub-Saharan Africa	Nigeria	52.68	21.38
Sub-Saharan Africa	Lesotho	53.06	20.99
Middle East & North Africa	Djibouti	62.30	20.56
Sub-Saharan Africa	Central African Republic	53.90	20.16
Middle East & North Africa	Yemen, Rep.	63.75	19.11
Sub-Saharan Africa	South Sudan	54.98	19.08
Sub-Saharan Africa	Somalia	55.28	18.77
South Asia	Afghanistan	61.98	17.94
Sub-Saharan Africa	Eswatini	57.07	16.99

It is almost all countries in Africa! They are lagging behind Cabo Verde where life expectancy is 74.052.

⊕ For a simple table: For grouped data:

Creating summaries of data by group with `summarize`

summarize takes a data table (or subtable, if groups are defined), and defines a summary indicator for the complete table. It is generally used in combination with functions such as n(), for counts, max() or min() (maximum and minimum of a variable in a group), mean() or median(), providing average levels. In the case of average life expectancy at the region level, we should not use the simple average because some countries have more population than others. We should use the weighted average, giving more weight according to population.

## Average (weighted) levels by continent
wbind |> 
  group_by(region) |>  # Analyze each region separately
  drop_na(pop) |>      # Drop countries with no pop data (Eritrea)
  summarize(`Weighted average`=weighted.mean(life_expectancy,pop,na.rm=TRUE),
 `Simple average`=mean(life_expectancy,na.rm=TRUE),
 Countries=n(),
 Missing=sum(is.na(life_expectancy)),
 `Pop (millions)`=sum(pop)/1e6
            ) |>      # Calculate summary statistics by group
  arrange(`Weighted average`) |> # Order regions according to increasing average level
  flextable() |>  # Create nice table
  colformat_double(digits = 1) |> 
  set_caption("Average life expectancy by region")

Average life expectancy by region
region	Weighted average	Simple average	Countries	Missing	Pop (millions)
Sub-Saharan Africa	60.2	61.8	48	0	1,181.2
South Asia	67.6	70.5	8	0	1,901.9
Latin America & Caribbean	72.1	73.0	42	2	655.0
Middle East & North Africa	72.7	74.4	21	0	486.2
East Asia & Pacific	76.3	73.6	37	3	2,346.7
Europe & Central Asia	76.6	77.1	58	3	923.6
North America	77.0	79.4	3	0	370.3

We see the weighted average is generally smaller. This is because small countries (eg: Monaco, Andorra, Mauritius, Hong-Kong), tend to have higher life expectancy.

⊕Basic, necessary, elements: Optional layers: # Doing graphs

The next day we will explore in detail how graphs are done using {ggplot2}. The idea is to:

Get the data object, generally a tibble.
Define what is in the paper, the aesthetics: Which are the \(x\) and \(y\) variables, do we want to use color linked to another variable, or size, …
Add the geometrical objects geom_xxx that we want.

Adding things to the graphic is done through + (instead of |>)!

Finally there is the possibility of adding more layers. This is a summary from this source

This is the code producing our graph. Try to make sense of each layer.

wbind |>
  ggplot(aes(
      x = gdp_capita, 
      y = life_expectancy
      )) +
  geom_point(aes(size=pop, color=region)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(
    labels = scales::dollar_format(),
    breaks = scales::log_breaks(n = 4)
  ) + 
  scale_size_continuous(
    labels = scales::number_format(scale = 1/1e6, suffix = "m"),
    breaks = seq(1e8,1e9, 2e8),
    range = c(1,20)
    ) +
  theme_minimal() +
  labs(
    title = "Life expectancy according to GDP per capita",
    x = "GDP per Capita",
    y = "Life expectancy (years)",
    size = "Population",
    color = NULL,
    caption = "Source: World Bank, through `wbstats`"
  )

Maps with `ggplot2`

Mapping the data is also relatively simple. We only need to specify a base map to which the color will depend on per-capita income. We can do it with ggplot2 replicating the vignette of {wbstats}.

One possibility is the world map in package tmap. We get it in our environment by:

library(tmap)
data(World)

The object is an sf (simple features) object. Basically a data object with one specific variable, geometry, that contains the polygons defining the country map.

str(World)

## Classes 'sf' and 'data.frame':   177 obs. of  16 variables:
##  $ iso_a3      : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
##  $ sovereignt  : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
##  $ continent   : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
##  $ area        : Units: [km^2] num  652860 1246700 27400 71252 2736690 ...
##  $ pop_est     : num  28400000 12799293 3639453 4798491 40913584 ...
##  $ pop_est_dens: num  43.5 10.3 132.8 67.3 15 ...
##  $ economy     : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
##  $ income_grp  : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
##  $ gdp_cap_est : num  784 8618 5993 38408 14027 ...
##  $ life_exp    : num  59.7 NA 77.3 NA 75.9 ...
##  $ well_being  : num  3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
##  $ footprint   : num  0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
##  $ inequality  : num  0.427 NA 0.165 NA 0.164 ...
##  $ HPI         : num  20.2 NA 36.8 NA 35.2 ...
##  $ geometry    :sfc_MULTIPOLYGON of length 177; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...

class(World)

## [1] "sf"         "data.frame"

It already has some variables that could be used for maps.

We can also see the map by:

ggplot(data = World) +
    geom_sf()

In order to map our data we have to bring it to the World object. Let us see how:

mi_mapa=World |> 
  filter(iso_a3 != "ATA") |> # remove Antarctica
  left_join(wbind |> rename(iso_a3=iso3c) |>      select(iso_a3,gdp_capita,                            pop,life_expectancy),
     by="iso_a3")

Now we do the new map

mi_mapa |> ggplot(aes(fill = life_expectancy)) +
  geom_sf() +
  scale_fill_viridis_c() +
  theme(legend.position="bottom") +
  labs(
    title = "Map of life expectancy in 2021",
    fill = NULL,
    caption = paste("Source: World Bank") 
  ) +
   coord_sf()

Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

12 mar., 2024

Introduction

New packages to install

Getting data from the World Bank

Joining data

Data manipulation: What we knew

Creating new variables: `mutate` and renaming variables with `rename`

Creating summaries of data by group with `summarize`

Maps with `ggplot2`

Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

12 mar., 2024

Introduction

New packages to install

Getting data from the World Bank

Joining data

Data manipulation: What we knew

Creating new variables: mutate and renaming variables with rename

Creating summaries of data by group with summarize

Maps with ggplot2

Creating new variables: `mutate` and renaming variables with `rename`

Creating summaries of data by group with `summarize`

Maps with `ggplot2`