Introduction

The other day we studied basic R and some of the basic functions of the tidyverse to work with data objects: filter and slice to get rows, select to choose variables, arrange to order rows, and, most important: group_by to define groups in the data so that operations take place at the group level. We also saw left_join to put together two data sets using a common variable (by argument).

Today we will get data for global monitoring looks from the World Development Indicators database using wbstats, and learn the few basic tools we did not see yet:

rename to change variable names, mutate to create new columns through calculation.
summarize to create a new data set of summary indicators where the new observations (rows) are the groups

You can again check the references provided in Studium, Timbers, Campbell, and Lee (⊕2022Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2022. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (⊕2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. http://qpolr.com/.), Wickham and Grolemund (⊕2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (⊕2022Ismay, Chester, and Albert Y. Kim. 2022. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

New packages to install

We will use today tmap to draw maps. If you have not installed it yet, please do so now:

install.packages("tmap")

Getting data from the World Bank

It is very simple to use wbstats if you know the names of the indicators from the World Development Indicators database. Today we download the 3 indicators used in gapminder: life expectancy, GDP per capita, and the total population. We also get the database of countries that we can look at.

my_indicators <- c(
  life_expectancy = "SP.DYN.LE00.IN", 
  gdp_capita ="NY.GDP.PCAP.CD", 
  pop = "SP.POP.TOTL"
  )



wbind <- wb_data(my_indicators, start_date = 2019) 

countries <- wb_countries()

This is the structure of the two data sets.

str(wbind)

## tibble [217 x 7] (S3: tbl_df/tbl/data.frame)
##  $ iso2c          : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c          : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country        : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date           : num [1:217] 2019 2019 2019 2019 2019 ...
##  $ gdp_capita     : num [1:217] NA 494 2810 5396 40897 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy: num [1:217] 76.3 64.8 61.1 78.6 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop            : num [1:217] 106310 38041757 31825299 2854191 77146 ...
##   ..- attr(*, "label")= chr "Population, total"

str(countries)

## tibble [299 x 18] (S3: tbl_df/tbl/data.frame)
##  $ iso3c             : chr [1:299] "ABW" "AFE" "AFG" "AFR" ...
##  $ iso2c             : chr [1:299] "AW" "ZH" "AF" "A9" ...
##  $ country           : chr [1:299] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
##  $ capital_city      : chr [1:299] "Oranjestad" NA "Kabul" NA ...
##  $ longitude         : num [1:299] -70 NA 69.2 NA NA ...
##  $ latitude          : num [1:299] 12.5 NA 34.5 NA NA ...
##  $ region_iso3c      : chr [1:299] "LCN" NA "SAS" NA ...
##  $ region_iso2c      : chr [1:299] "ZJ" NA "8S" NA ...
##  $ region            : chr [1:299] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
##  $ admin_region_iso3c: chr [1:299] NA NA "SAS" NA ...
##  $ admin_region_iso2c: chr [1:299] NA NA "8S" NA ...
##  $ admin_region      : chr [1:299] NA NA "South Asia" NA ...
##  $ income_level_iso3c: chr [1:299] "HIC" NA "LIC" NA ...
##  $ income_level_iso2c: chr [1:299] "XD" NA "XM" NA ...
##  $ income_level      : chr [1:299] "High income" "Aggregates" "Low income" "Aggregates" ...
##  $ lending_type_iso3c: chr [1:299] "LNX" NA "IDX" NA ...
##  $ lending_type_iso2c: chr [1:299] "XX" NA "XI" NA ...
##  $ lending_type      : chr [1:299] "Not classified" "Aggregates" "IDA" "Aggregates" ...

Joining data

As we saw the other day, there are three conditions to join data sets:

They must have at least one variable in common.
The variables in common must have the same name and the same type of data.
If they do not, we should change the names or the type of data before joining.

In our case there are three common variables: iso2c, iso3c and country. We can join based on any of them, although it is better to use a code variable.

When there is more than one common variable we can do several things:

Specify all of them in the by argument.

wbind <- wbind %>% %>% 
  left_join(countries, by=c("iso2c","iso3c","country"))

If they share information, let R guess what are the by variables leaving it blank. This can be problematic if not all variables share the content, for instance, if the country names differ among the data sets.

wbind <- wbind %>% %>% 
  left_join(countries)

Remove from the second set the rest of common variables and use only one of them. Otherwise R changes the names of the variables in both objects since there are two variables with the same name. This strategy takes a bit longer but it is safer.

wbind <- wbind %>% 
  left_join(countries %>% select(-iso2c,-country), by="iso3c")

Since we know the two data sets are compatible coming from the same source, We opt for the second (shortest) choice.

wbind <- wbind %>% 
  left_join(countries)

## Joining, by = c("iso2c", "iso3c", "country", "capital_city", "longitude", "latitude", "region_iso3c", "region_iso2c", "region", "admin_region_iso3c", "admin_region_iso2c", "admin_region", "income_level_iso3c", "income_level_iso2c", "income_level", "lending_type_iso3c", "lending_type_iso2c", "lending_type")

Note the message telling us the variables used for joining.

Now wbind has all the information in country:

str(wbind)

## tibble [217 x 22] (S3: tbl_df/tbl/data.frame)
##  $ iso2c             : chr [1:217] "AW" "AF" "AO" "AL" ...
##  $ iso3c             : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
##  $ country           : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ date              : num [1:217] 2019 2019 2019 2019 2019 ...
##  $ gdp_capita        : num [1:217] NA 494 2810 5396 40897 ...
##   ..- attr(*, "label")= chr "GDP per capita (current US$)"
##  $ life_expectancy   : num [1:217] 76.3 64.8 61.1 78.6 NA ...
##   ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
##  $ pop               : num [1:217] 106310 38041757 31825299 2854191 77146 ...
##   ..- attr(*, "label")= chr "Population, total"
##  $ capital_city      : chr [1:217] "Oranjestad" "Kabul" "Luanda" "Tirane" ...
##  $ longitude         : num [1:217] -70.02 69.18 13.24 19.82 1.52 ...
##  $ latitude          : num [1:217] 12.52 34.52 -8.81 41.33 42.51 ...
##  $ region_iso3c      : chr [1:217] "LCN" "SAS" "SSF" "ECS" ...
##  $ region_iso2c      : chr [1:217] "ZJ" "8S" "ZG" "Z7" ...
##  $ region            : chr [1:217] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
##  $ admin_region_iso3c: chr [1:217] NA "SAS" "SSA" "ECA" ...
##  $ admin_region_iso2c: chr [1:217] NA "8S" "ZF" "7E" ...
##  $ admin_region      : chr [1:217] NA "South Asia" "Sub-Saharan Africa (excluding high income)" "Europe & Central Asia (excluding high income)" ...
##  $ income_level_iso3c: chr [1:217] "HIC" "LIC" "LMC" "UMC" ...
##  $ income_level_iso2c: chr [1:217] "XD" "XM" "XN" "XT" ...
##  $ income_level      : chr [1:217] "High income" "Low income" "Lower middle income" "Upper middle income" ...
##  $ lending_type_iso3c: chr [1:217] "LNX" "IDX" "IBD" "IBD" ...
##  $ lending_type_iso2c: chr [1:217] "XX" "XI" "XF" "XF" ...
##  $ lending_type      : chr [1:217] "Not classified" "IDA" "IBRD" "IBRD" ...

Data manipulation: What we knew

See what these applications from the other day do:

wbind %>% 
  filter(admin_region=="South Asia") %>% 
  select(country,gdp_capita:pop) %>%
  flextable() %>%  # Create nice table
  colformat_double(digits = 1) # Limit to one the decimal digits.

country	gdp_capita	life_expectancy	pop
Afghanistan	494.2	64.8	38,041,757.0
Bangladesh	1,855.7	72.6	163,046,173.0
Bhutan	3,322.9	71.8	763,094.0
India	2,100.8	69.7	1,366,417,756.0
Sri Lanka	3,852.3	77.0	21,803,000.0
Maldives	10,561.6	78.9	530,957.0
Nepal	1,195.0	70.8	28,608,715.0
Pakistan	1,288.6	67.3	216,565,317.0

⊕ The effects of group_by

wbind %>%
  group_by(region) %>% 
  arrange(life_expectancy) %>% 
  slice(1) %>% 
  select(region,country,life_expectancy) %>%
  flextable() %>%  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

region	country	life_expectancy
East Asia & Pacific	Papua New Guinea	65
Europe & Central Asia	Turkmenistan	68
Latin America & Caribbean	Haiti	64
Middle East & North Africa	Yemen, Rep.	66
North America	United States	79
South Asia	Afghanistan	65
Sub-Saharan Africa	Central African Republic	53

⊕ The effects of group_by

Creating new variables: `mutate` and renaming variables with `rename`

The GDP per capita is a measure of standard of living, how much money the average person makes. Sometimes we want to compare the size of the economy. We can create new variables specifying the formula that creates them. In this case, GDP=pop*gdp_capita. We should also pay attentions to units. Since gdp_capita was measured in dollars, we get too many figures. We can divide by \(1e9\) to express the size of the economy in billion dollars.

wbind %>% 
  mutate(GDP=pop*gdp_capita/1e9) %>% 
  arrange(-GDP) %>% 
  select(country,GDP,gdp_capita,pop) %>% 
  slice(1:10) %>% 
  rename(`Per capita GDP`=gdp_capita,
         `Population`=pop) %>% 
  flextable() %>%  # Create nice table
  colformat_double(digits = 0) # No decimal digits.

country	GDP	Per capita GDP	Population
United States	21,433	65,280	328,329,953
China	14,280	10,144	1,407,745,000
Japan	5,149	40,778	126,264,931
Germany	3,888	46,795	83,092,962
United Kingdom	2,879	43,070	66,836,327
India	2,871	2,101	1,366,417,756
France	2,729	40,579	67,248,926
Italy	2,009	33,642	59,729,081
Brazil	1,878	8,898	211,049,519
Canada	1,742	46,338	37,593,384

A very interesting possibility is to combine mutate with a group_by. This results in new columns that are calculated for each subtable separately. For instance, we can define at the region level a variable defined as the maximum life expectancy in the region minus the present one, and then check the countries whose life expectancy is most far from their region’s leader:

wbind %>% 
  group_by(region) %>% 
  mutate(dif_LE=max(life_expectancy)-life_expectancy) %>% 
  ungroup() %>% 
  arrange(-dif_LE) %>% 
  select(region, country,
         LE=life_expectancy,
         dif_LE) %>% 
  slice(1:10) %>% 
  flextable() %>%  # Create nice table
  colformat_double(digits = 2) # 2 decimal digits.

region	country	LE	dif_LE
Sub-Saharan Africa	Central African Republic	53.28	20.95
Sub-Saharan Africa	Chad	54.24	20.00
Sub-Saharan Africa	Lesotho	54.33	19.90
Sub-Saharan Africa	Nigeria	54.69	19.55
Sub-Saharan Africa	Sierra Leone	54.70	19.54
Sub-Saharan Africa	Somalia	57.40	16.84
Middle East & North Africa	Yemen, Rep.	66.12	16.68
Sub-Saharan Africa	Cote d'Ivoire	57.78	16.45
Sub-Saharan Africa	South Sudan	57.85	16.39
Sub-Saharan Africa	Guinea-Bissau	58.32	15.91

It is all countries in Africa! They are lagging behind Mauritius where life expectancy is 74.2358537.

⊕ For a simple table: For grouped data:

Creating summaries of data by group with `summarize`

summarize takes a data table (or subtable, if groups are defined), and defines a summary indicator for the complete table. It is generally used in combination with functions such as n(), for counts, max() or min() (maximum and minimum of a variable in a group), mean() or median(), providing average levels. In the case of average life expectancy at the region level, we should not use the simple average because some countries have more population than others. We should use the weighted average, giving more weight according to population.

## Average (weighted) levels by continent
wbind %>% 
  group_by(region) %>%  # Analyze each region separately
  drop_na(pop) %>%      # Drop countries with no pop data (Eritrea)
  summarize(`Weighted average`=weighted.mean(life_expectancy,pop,na.rm=TRUE),
 `Simple average`=mean(life_expectancy,na.rm=TRUE),
 Countries=n(),
 Missing=sum(is.na(life_expectancy)),
 `Pop (millions)`=sum(pop)/1e6
            ) %>%      # Calculate summary statistics by group
  arrange(`Weighted average`) %>% # Order regions according to increasing average level
  flextable() %>%  # Create nice table
  colformat_double(digits = 1) %>% 
  set_caption("Average life expectancy by region")

Average life expectancy by region
region	Weighted average	Simple average	Countries	Missing	Pop (millions)
Sub-Saharan Africa	61.6	62.8	47	0	1,103.5
South Asia	69.6	71.6	8	0	1,835.8
Middle East & North Africa	74.3	75.5	21	0	456.7
Latin America & Caribbean	75.6	75.4	42	7	646.4
East Asia & Pacific	76.2	74.7	37	6	2,327.1
Europe & Central Asia	78.0	78.1	58	6	920.8
North America	79.1	80.9	3	0	366.0

We see the weighted average is generally smaller. This is because small countries (eg: Monaco, Andorra, Mauritius, Hong-Kong), tend to have higher life expectancy.

⊕Basic, necessary, elements: Optional layers: # Doing graphs

The next day we will explore in detail how graphs are done using {ggplot2}. The idea is to:

Get the data object, generally a tibble.
Define what is in the paper, the aesthetics: Which are the \(x\) and \(y\) variables, do we want to use color linked to another variable, or size, …
Add the geometrical objects geom_xxx that we want.

Adding things to the graphic is done through + (instead of %>%)!

Finally there is the possibility of adding more layers. This is a summary from this source

This is the code producing our graph. Try to make sense of each layer.

wbind %>%
  ggplot(aes(
      x = gdp_capita, 
      y = life_expectancy
      )) +
  geom_point(aes(size=pop, color=region)) +
  geom_smooth(se = FALSE) +
  scale_x_log10(
    labels = scales::dollar_format(),
    breaks = scales::log_breaks(n = 4)
  ) +
  scale_size_continuous(
    labels = scales::number_format(scale = 1/1e6, suffix = "m"),
    breaks = seq(1e8,1e9, 2e8),
    range = c(1,20)
    ) +
  theme_minimal() +
  labs(
    title = "Life expectancy according to GDP per capita",
    x = "GDP per Capita (log scale)",
    y = "Life expectancy (years)",
    size = "Population",
    color = NULL,
    caption = "Source: World Bank, through `wbstats`"
  )

Maps with `ggplot2`

Mapping the data is also relatively simple. We only need to specify a base map to which the color will depend on per-capita income. We can do it with ggplot2 replicating the vignette of {wbstats}.

One possibility is the world map in package tmap. We get it in our environment by:

library(tmap)
data(World)

The object is an sf (simple features) object. Basically a data object with one specific variable, geometry, that contains the polygons defining the country map.

str(World)

## Classes 'sf' and 'data.frame':   177 obs. of  16 variables:
##  $ iso_a3      : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ name        : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
##  $ sovereignt  : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
##  $ continent   : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
##  $ area        : Units: [km^2] num  652860 1246700 27400 71252 2736690 ...
##  $ pop_est     : num  28400000 12799293 3639453 4798491 40913584 ...
##  $ pop_est_dens: num  43.5 10.3 132.8 67.3 15 ...
##  $ economy     : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
##  $ income_grp  : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
##  $ gdp_cap_est : num  784 8618 5993 38408 14027 ...
##  $ life_exp    : num  59.7 NA 77.3 NA 75.9 ...
##  $ well_being  : num  3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
##  $ footprint   : num  0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
##  $ inequality  : num  0.427 NA 0.165 NA 0.164 ...
##  $ HPI         : num  20.2 NA 36.8 NA 35.2 ...
##  $ geometry    :sfc_MULTIPOLYGON of length 177; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...

class(World)

## [1] "sf"         "data.frame"

It already has some variables that could be used for maps.

We can also see the map by:

ggplot(data = World) +
    geom_sf()

In order to map our data we have to bring it to the World object. Let us see how:

mi_mapa=World %>% 
  filter(iso_a3 != "ATA") %>% # remove Antarctica
  left_join(wbind %>% rename(iso_a3=iso3c) %>%      select(iso_a3,gdp_capita,                            pop,life_expectancy),
     by="iso_a3")

Now we do the new map

mi_mapa %>% ggplot(aes(fill = life_expectancy)) +
  geom_sf() +
  scale_fill_viridis_c() +
  theme(legend.position="bottom") +
  labs(
    title = "Map of life expectancy in 2019",
    fill = NULL,
    caption = paste("Source: World Bank") 
  ) +
   coord_sf()

Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

01 mar., 2022

Introduction

New packages to install

Getting data from the World Bank

Joining data

Data manipulation: What we knew

Creating new variables: `mutate` and renaming variables with `rename`

Creating summaries of data by group with `summarize`

Maps with `ggplot2`

Monitoring global indicators with R and the tidyverse (ii)

Global Development Indicators, Global Studies

José Antonio Ortega

01 mar., 2022

Introduction

New packages to install

Getting data from the World Bank

Joining data

Data manipulation: What we knew

Creating new variables: mutate and renaming variables with rename

Creating summaries of data by group with summarize

Maps with ggplot2

Creating new variables: `mutate` and renaming variables with `rename`

Creating summaries of data by group with `summarize`

Maps with `ggplot2`