Monitoring global indicators with R and the tidyverse

Global Development Indicators, Global Studies

José Antonio Ortega

13 feb., 2024

Introduction to R

We are going to use R for monitoring global development indicators throughout the course. We do not need to become experts in R, but a few key ideas will make it easier for us to proceed. You can check the references provided in Studium, Timbers, Campbell, and Lee (2023Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2023. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. https://github.com/erikgahner/qpolr/raw/master/book/_book/qpolr.pdf.), Wickham, Mine Çetinkaya-Rundel, and Grolemund (2023Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. O’Reilly Media, Inc. https://r4ds.hadley.nz/.). We will also use figures from Ismay and Kim (2024Ismay, Chester, and Albert Y. Kim. 2024. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

Installing

You can install the latest versions of R and RStudio for your platform from the RStudio website. After installing both, in that order, you can install the packages we will use for this course. You can run this code in the console when you open RStudio:

install.packages("tidyverse",dependencies=TRUE)
install.packages("gapminder")
install.packages("plotly")
install.packages("wbstats")
install.packages("remotes")
remotes::install_github("caldwellst/goalie")
install.packages("countrycode")
install.packages("flextable")
install.packages("tufte")
install.packages("tmap")
install.packages("patchwork")
install.packages("ggrepel")

Objects, functions

The difficult part of R is that it is a language, not menu-driven. It therefore expects you to make an order with the > sign. You can do, for instance, arithmetic operations or text operations

2 + 2
## [1] 4
paste("My name is","R.","Pleased to meet you!")
## [1] "My name is R. Pleased to meet you!"

R works with objects. The three kinds of objects we will use most are elementary objects, data objects and functions. Objects have a name that can be assigned either through the = or the <- operators. An elementary object is a collection of elements of the same type (numeric, character, …). Let us define to elementary objects we will use later:

my_countries=c("Angola","Mozambique")
my_years <- c(1980,2000,2011:2020)

Assignment operators: = , <-

When you assign a name to an object it becomes available in the environment. You can check it in the RStudio window with that name.

The second type of objects are functions. Functions are equivalent to verbs, they do something with the objects available or supplied. We have seen several functions already install.packages, which says what it does, paste, also intuitive, and c, from collection, that puts together differents elements in an object. You can get help regarding what R functions do in the help window or typing ? paste, for instance.

function(arg1=default1,arg2=default2, …)

Functions generally require arguments, that qualify the action required. The help function provides details regarding the arguments. Also, RStudio reminds you the available arguments when you start writing the name of a function. In order for R to know the values of arguments, you can either write the name of the argument followed by the value, or rely on the order of the argument. For instance, paste has an argument sep. You can set it like this

paste(1:length(my_countries),my_countries,sep=": ")
## [1] "1: Angola"     "2: Mozambique"

If we want to select parts of an elementary object we do it with brackets:

my_years[2:4]
## [1] 2000 2011 2012

Exercises:

  1. The object letters contains the letters in the alphabet. Create an object called my_letters that contains letters from the sixth to the tenth.

Data objects: tibbles and data.frames

For monitoring indicators the key type of object is a data object. A data object is similar to a spreadsheet having columns with different variables that take values of the same type, and rows having observations. Such data is called to be in tidy form.

Data can be obtained from an external file in any data format or using a package to download data from a website or a server. It is a very smooth process! We will use the {goalie} package (Caldwell 2022Caldwell, Seth. 2022. Goalie: R Client for the Sustainable Development Goals API. https://gpw13.github.io/goalie/.) to download official SDG indicators data from https://unstats.un.org/sdgs/indicators/database/. Before proceeding we need to make the relevant R packages available to our session with library.

library(goalie)
library(tidyverse)
pov=sdg_data("SI_POV_DAY1")

We can check the first lines of pov with pov.

pov
## # A tibble: 9,775 × 23
##     Goal Target Indicator SeriesCode  SeriesDescription  GeoAreaCode GeoAreaName
##    <dbl>  <dbl> <chr>     <chr>       <chr>                    <dbl> <chr>      
##  1     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  2     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  3     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  4     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  5     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  6     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  7     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  8     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
##  9     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
## 10     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…           1 World      
## # ℹ 9,765 more rows
## # ℹ 16 more variables: TimePeriod <dbl>, Value <dbl>, Time_Detail <dbl>,
## #   TimeCoverage <lgl>, UpperBound <lgl>, LowerBound <lgl>, BasePeriod <dbl>,
## #   Source <chr>, GeoInfoUrl <lgl>, FootNote <chr>, Age <chr>, Location <chr>,
## #   Nature <chr>, Reporting_Type <chr>, Sex <chr>, Units <chr>

You can see it is a very large table, made up of 9775 observations and 23 variables. In order to inspect it, you can click on the object name in the Global environment window. Please do so in your RStudio session. We can also do things, similar to a spreadsheet, in the viewer pane that opened. The difference is that we can not make changes by error as in a spreadsheet. We can see that the data corresponds to the SDG indicator Proportion of population below international poverty line (%) that corresponds to SDG goal 1 and target 1.1. The current poverty line is $2.15 a day at 2017 international prices. The main column containing the data is a variable called Value, and the data observations are identified by a combination of area (identified either by GeoAreaCode or GeoAreaName) and year (TimePeriod). The indicator code, SI_POV_DAY1, serves to identify the data. As you see, this was the only argument we had to supply the function sdg_data() so that it could query the database. Note that the data returned includes all countries and periods, and all available combinations of sex, age, and location (rural, urban, both). This is in line with the SDG mandate to monitor separately for sub-groups whenever possible.

It is very often easier to get all the available data, as in this case, and later use R to select the particular data points we are interested in analyzing. That is the kind of elementary operations that we need to learn.

Tidyverse: manipulation of data objects

The tidyverse and the pipes, %>% |>

magrittr pipe triangle pipe

The so-called tidyverse is a collection of packages (including tidyr, dplyr, ggplot2) that facilitate our work with data objects. We will be working with data objects all the time. Let us learn how to select our data of interest and make some tables and plots with it.

Before doing so, let me present you with our friends, the pipes, also known as then operators. The native triangle pipe is written as |> and you can picture it as a triangle. It basically replaces the previous pipe %>% that could only be used in the tidyverse. You can write the pipe in RStudio with CTRL+SHIFT+M, and you can choose your favorite one (I recommend |>), in the Tools/Global Options/Code page of RStudio. The pipe (tuberia) or then operator, makes it easy to instruct R to do things in sequence. When you write something with the pipe such as

wakeup() |>  wash_teeth(tootbrush,toothpaste) |> 
drink_coffee()  |>  go_to_work(by=bus)

you are using a code that is as close as possible to natural language. The idea is that the result of the previous statement is used as the argument for the next function in the pipe. All the functions in the tidyverse work that way.

filter and slice for selecting rows

In such a big object a first thing is to select parts of the data. If we are interested only in some of the observations, say, those from Spain from dates after 2018, we can do it as follows

pov |>  filter(GeoAreaName=="Spain",
               TimePeriod>2018)
## # A tibble: 16 × 23
##     Goal Target Indicator SeriesCode  SeriesDescription  GeoAreaCode GeoAreaName
##    <dbl>  <dbl> <chr>     <chr>       <chr>                    <dbl> <chr>      
##  1     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  2     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  3     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  4     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  5     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  6     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  7     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  8     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  9     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 10     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 11     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 12     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 13     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 14     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 15     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 16     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## # ℹ 16 more variables: TimePeriod <dbl>, Value <dbl>, Time_Detail <dbl>,
## #   TimeCoverage <lgl>, UpperBound <lgl>, LowerBound <lgl>, BasePeriod <dbl>,
## #   Source <chr>, GeoInfoUrl <lgl>, FootNote <chr>, Age <chr>, Location <chr>,
## #   Nature <chr>, Reporting_Type <chr>, Sex <chr>, Units <chr>

It is important to use ==,two equals. This is the comparison operator as different from the assignment operator.

A problem is that R does not fit all the variables and we don’t see what we get. We will learn later how to select variables. We can also use a nice package for sophisticated tables, flextable. Making a nice table is as simple as adding flextable() to the pipe flow.

library(flextable)
pov  |>  filter(GeoAreaName=="Spain",
                TimePeriod>2018) |>  flextable()

Goal

Target

Indicator

SeriesCode

SeriesDescription

GeoAreaCode

GeoAreaName

TimePeriod

Value

Time_Detail

TimeCoverage

UpperBound

LowerBound

BasePeriod

Source

GeoInfoUrl

FootNote

Age

Location

Nature

Reporting_Type

Sex

Units

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.80000

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

1.31139

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

65+

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.78794

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

15-64

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.86093

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

ALLAREA

G

G

FEMALE

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.79484

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

<15Y

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.75608

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

RURAL

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.93063

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

ALLAREA

G

G

MALE

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.94392

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

URBAN

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.02577

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

15-64

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.32191

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

<15Y

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

0.90000

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.06109

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

ALLAREA

G

G

MALE

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.19880

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

ALLAREA

G

G

FEMALE

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.34605

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

65+

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.14139

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

URBAN

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

1.06405

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.

ALLAGE

RURAL

G

G

BOTHSEX

PERCENT

We can see that there are different indicators according to age and sex. If we want to concentrate on time trends for a specific population, we should filter accordingly:

pov  |>  filter(GeoAreaName=="Spain",
                Age=="ALLAGE",
                Sex=="BOTHSEX",
                Location=="ALLAREA") |>
  flextable()

Goal

Target

Indicator

SeriesCode

SeriesDescription

GeoAreaCode

GeoAreaName

TimePeriod

Value

Time_Detail

TimeCoverage

UpperBound

LowerBound

BasePeriod

Source

GeoInfoUrl

FootNote

Age

Location

Nature

Reporting_Type

Sex

Units

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,980

0.5

1,980

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,985

0.2

1,985

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,990

0.5

1,990

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,995

1.2

1,995

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from ECHP-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,000

0.5

2,000

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from ECHP-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,003

0.8

2,003

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,004

0.8

2,004

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,005

0.7

2,005

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,006

0.8

2,006

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,007

0.4

2,007

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,008

0.5

2,008

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,009

0.8

2,009

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,010

0.7

2,010

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,011

1.1

2,011

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,012

0.9

2,012

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,013

1.1

2,013

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,014

0.8

2,014

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,015

0.9

2,015

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,016

1.1

2,016

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,017

0.7

2,017

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,018

0.8

2,018

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,019

0.8

2,019

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

2,020

0.9

2,020

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

There is one more useful function for selecting observations: slice. slice selects by sequence. For instance, if we want the first three data points for Spain we would do:

pov  |>  
  filter(GeoAreaName=="Spain") |> 
  slice(1:3) |> 
  flextable()

Goal

Target

Indicator

SeriesCode

SeriesDescription

GeoAreaCode

GeoAreaName

TimePeriod

Value

Time_Detail

TimeCoverage

UpperBound

LowerBound

BasePeriod

Source

GeoInfoUrl

FootNote

Age

Location

Nature

Reporting_Type

Sex

Units

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,980

0.5

1,980

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,985

0.2

1,985

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

724

Spain

1,990

0.5

1,990

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

We see that the proportions living in extreme poverty in Spain, while always low, are larger in the latest data points than in the 1980s.

Sometimes we are interested not in matching one value but a range of values. We already had created the objects my_countries and my_values. We can use them in the filtering in connection with the operator %in%:

pov |>  
  filter(GeoAreaName %in% my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
   flextable()

Goal

Target

Indicator

SeriesCode

SeriesDescription

GeoAreaCode

GeoAreaName

TimePeriod

Value

Time_Detail

TimeCoverage

UpperBound

LowerBound

BasePeriod

Source

GeoInfoUrl

FootNote

Age

Location

Nature

Reporting_Type

Sex

Units

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

24

Angola

2,000

21.4

2,000

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from HBS. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

24

Angola

2,008

14.6

2,008

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from IBEP-MICS. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

24

Angola

2,018

31.1

2,018

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from IDREA. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

508

Mozambique

1,996

82.7

1,996

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from NHS. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

508

Mozambique

2,002

80.6

2,002

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from IAF. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

508

Mozambique

2,008

70.8

2,008

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from IOF. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

1

1.1

1.1.1

SI_POV_DAY1

Proportion of population below international poverty line (%)

508

Mozambique

2,014

64.6

2,014

2,017

Poverty and Inequality Portal, World Bank

Accessed March 21, 2023. Based on data from IOF. Estimated from unit-record consumption data.

ALLAGE

ALLAREA

G

G

BOTHSEX

PERCENT

select to select variables

The tables above are cumbersome because they display too much information that is not needed. We would like to focus on the variables of our interest. This can be done with **select**, providing the list of variables we want to select. We can provide them by name:

pov |>  
  filter(GeoAreaName %in% my_countries[1],
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
  select(GeoAreaName,TimePeriod,Value) |> 
   flextable()

GeoAreaName

TimePeriod

Value

Angola

2,000

21.4

Angola

2,008

14.6

Angola

2,018

31.1

where we see that the proportions living in extreme poverty in Angola increased in 2018.

We can also use helper-functions so that we do not need to type all the variable names:

pov |> 
  filter(GeoAreaName%in% my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX"
         ) |>
  select(starts_with("GeoArea"),TimePeriod,Value) |> 
   flextable()

GeoAreaCode

GeoAreaName

TimePeriod

Value

24

Angola

2,000

21.4

24

Angola

2,008

14.6

24

Angola

2,018

31.1

508

Mozambique

1,996

82.7

508

Mozambique

2,002

80.6

508

Mozambique

2,008

70.8

508

Mozambique

2,014

64.6

We see the poverty situation is even worse in Mozambique. We also see the GeoAreaCode. This is a unique number identifier of the country or region. They are called ISO-CODES. ISO-CODES are useful for identification without having to write the complete name or caring about ortography. The problem with numeric ones is that the are difficult to retain. We will also use 3-letter codes that are easier to retain. You can find them in the link provided, there is also a package in R, countrycode, that contains all of them. To see the problems I comment, let us try to get the isocode data for some countries using the object codelist provided by countrycode.

library(countrycode)
codelist |> filter(country.name.en %in% 
              c(my_countries,"Sao Tome and Principe"))
## # A tibble: 2 × 621
##   ar5   cctld continent country.name.de country.name.de.regex country.name.en
##   <chr> <chr> <chr>     <chr>           <chr>                 <chr>          
## 1 MAF   .ao   Africa    Angola          angola                Angola         
## 2 MAF   .mz   Africa    Mosambik        mosambik              Mozambique     
## # ℹ 615 more variables: country.name.en.regex <chr>, country.name.fr <chr>,
## #   country.name.fr.regex <chr>, country.name.it <chr>,
## #   country.name.it.regex <chr>, cow.name <chr>, cowc <chr>, cown <dbl>,
## #   currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>, eurocontrol_pru <chr>,
## #   eurocontrol_statfor <chr>, eurostat <chr>, fao <dbl>, fips <chr>,
## #   gaul <dbl>, genc2c <chr>, genc3c <chr>, genc3n <chr>, gwc <chr>, gwn <dbl>,
## #   icao.region <chr>, imf <dbl>, ioc <chr>, iso.name.en <chr>, …

It finds our 2 countries, but it does not find a third former Colony of Portugal in Africa, Sao Tome and Principe, because the official name is “São Tomé & Príncipe”, despite those signs not being “English” letters. We could find it with:

codelist |> filter(country.name.en == "São Tomé & Príncipe")
## # A tibble: 1 × 621
##   ar5   cctld continent country.name.de    country.name.de.regex country.name.en
##   <chr> <chr> <chr>     <chr>              <chr>                 <chr>          
## 1 MAF   .st   Africa    Sao Tome und Prin… "\\bs(a|ã)o.?tom(e|é… São Tomé & Prí…
## # ℹ 615 more variables: country.name.en.regex <chr>, country.name.fr <chr>,
## #   country.name.fr.regex <chr>, country.name.it <chr>,
## #   country.name.it.regex <chr>, cow.name <chr>, cowc <chr>, cown <dbl>,
## #   currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>, eurocontrol_pru <chr>,
## #   eurocontrol_statfor <chr>, eurostat <chr>, fao <dbl>, fips <chr>,
## #   gaul <dbl>, genc2c <chr>, genc3c <chr>, genc3n <chr>, gwc <chr>, gwn <dbl>,
## #   icao.region <chr>, imf <dbl>, ioc <chr>, iso.name.en <chr>, …

but, how do we know? It is better to use codes.

Ordering data, arrange

Our data is currently organized according to country. If we want it to be organized according to another combination of variables we use arrange. For instance, if we want to order the data for our countries first in descending order of Value we use

pov |> 
  filter(GeoAreaName%in%my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-Value) |>  flextable()

GeoAreaCode

GeoAreaName

TimePeriod

Value

508

Mozambique

1,996

82.7

508

Mozambique

2,002

80.6

508

Mozambique

2,008

70.8

508

Mozambique

2,014

64.6

24

Angola

2,018

31.1

24

Angola

2,000

21.4

24

Angola

2,008

14.6

We see the worse problems happen in Mozambique. Using -Value means in descending order for numeric variables. For reverse alphabetical order we would need desc(GeoAreaName). We can combine it with other criteria that are applied in case of ties:

pov |> 
  filter(GeoAreaName%in%my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX",TimePeriod>=2000)  |>  
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(desc(GeoAreaName),TimePeriod)
## # A tibble: 6 × 4
##   GeoAreaCode GeoAreaName TimePeriod Value
##         <dbl> <chr>            <dbl> <dbl>
## 1         508 Mozambique        2002  80.6
## 2         508 Mozambique        2008  70.8
## 3         508 Mozambique        2014  64.6
## 4          24 Angola            2000  21.4
## 5          24 Angola            2008  14.6
## 6          24 Angola            2018  31.1

If we only provide the name of a variable, it is ordered in natural order. We can add several conditions separated by commas. For instance, if we want to see the 5 countries with the highest poverty rates for all the population in the latest available year(s) we would do:

pov |>
  filter(Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX") |>
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-TimePeriod,-Value) |> 
  slice(1:5) |> flextable()

GeoAreaCode

GeoAreaName

TimePeriod

Value

360

Indonesia

2,022

2.5

170

Colombia

2,021

6.6

76

Brazil

2,021

5.8

268

Georgia

2,021

5.5

19

Americas

2,021

4.7

We see that the latest data corresponds to 2022, and among the data for that year the highest corresponds to Indonesia. However, we found that our two countries, for which there is no data available after 2018, had higher poverty rates. Note that we have to be careful about how to analyze the data. We will deal with this by choosing for each country its latest data.

Grouping data for analysis

Now we are getting to one of the key ideas in data management: In order to be sure not to miss countries, we would like to keep for each country the latest observation. This is not the same as filtering the data for 2020 or 2021 that does not contain all the countries. How to do it? In the tidyverse we define groups with group_by. After defining groups, the operations we do are defined at the group level, as if we were working with separate small data tables. For instance, to get the latest poverty data for the complete population we would do

latest = pov |> 
  filter(Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX") |>
  group_by(GeoAreaName) |> 
  arrange(-TimePeriod) |> 
  slice(1) |> 
  ungroup()

This creates a new data object called latest containing only our data of interest. We had to use ungroup() to tell the tidyverse that, from now onwards, we don’t want to analyze by country, but data for all countries. To keep the countries with the worse poverty we would do

latest |> select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-Value) |> 
  slice(1:10) |> flextable()

GeoAreaCode

GeoAreaName

TimePeriod

Value

860

Uzbekistan

2,003

81.5

450

Madagascar

2,012

80.7

454

Malawi

2,019

70.1

180

Democratic Republic of the Congo

2,012

69.7

728

South Sudan

2,016

67.3

108

Burundi

2,013

65.1

508

Mozambique

2,014

64.6

140

Central African Republic

2,008

61.9

894

Zambia

2,015

61.4

646

Rwanda

2,016

52.0

Note that we could have done all the analysis in one-step, but sometimes it is useful to retain objects, such as latest, for future analysis. Now the picture is quite different! Note that many of the countries with the highest values only have relatively old data. This might not be by chance: poverty surveys are expensive and technically complex, and the poorest countries do not carry them out often. Note also that the vast majority of countries belong to sub-Saharan Africa, the region of the world most problematic in terms of SDG achievement.

Joining different datasets: left_join

Note that the geographic information in this dataset is rather sparse: we only know the numeric code and the country name but we do not know in what country or region the country lies, or its 3-letter isocodes. We have that information in a separate object, however: codelist. How can we obtain that information from codelist and bring it to pov: by joining datasets. This is another key concept in data handling.

The key ideas for joining two datasets are the following:

Joining can be done according to the combination of variables specified in the by argument. If left unspecified, all the variables with the same names are tried (although there might be errors if the conditions set above are not met).

In our case, we want to use the numeric ISO code to do the join. In pov it is called GeoAreaCode and it is a numeric variable. the case of codelist, there is a variable called iso3n that has the country code and it is also numeric (You can inspect the object if you need so with codelist |> View()). In order to do the join, we need to rename the variable iso3n as GeoAreaCode prior to joining. We can do it with a special function, rename, or else directly within the select argument providing a name. If we do not want to bring all the variables in codelist to pov we could also select them prior to joining. For this time, we use all variables.

pov2 = pov |> 
  left_join(codelist |> rename(GeoAreaCode=iso3n))

We can inspect the, widened, data object with pov2 |> View. It has probably TOO MANY variables. If we want only to add the variables we are interested in we can select them prior to joining. For instance,

latest2 = latest |> 
  left_join(codelist |>
              select(GeoAreaCode=iso3n, un.name.es,
                     continent, region, iso3c),
            by="GeoAreaCode")

Now we have possibilities for further analysis. For instance, let us choose the country in each region with the highest poverty rate. That would require grouping by region.

latest2 |> group_by(region) |> 
  slice_max(Value) |> 
  select(region,iso3c,GeoAreaName,Value) |> 
  arrange(-Value) |> flextable()

region

iso3c

GeoAreaName

Value

Europe & Central Asia

UZB

Uzbekistan

81.5

Sub-Saharan Africa

MDG

Madagascar

80.7

Middle Africa

45.8

East Asia & Pacific

PNG

Papua New Guinea

39.7

Latin America & Caribbean

HTI

Haiti

29.2

Middle East & North Africa

YEM

Yemen

19.8

South Asia

BGD

Bangladesh

13.5

North America

CAN

Canada

0.2

North America

USA

United States of America

0.2

We see here how Europe & Cental Asia and sub-Saharan Africa are on top, and how in many regions of the world, including the most populated, there is no single country with poverty rates exceeding 30%. Note also that the 3-letter isocodes are much more intuitive.

Missing data: NA

One more thing to note is that there is a “strange” data with no region name for Middle Africa. Why is this? The pov object does not only contain rates for countries. It also includes country aggregates such as middle Africa. These aggregates do not have a regional code or an iso3c code. If we look at the latest2 data object we will see that such data is shown in red and displayed as “NA”, which means not available.

Very often we do not want to keep the data that has NA in some dimensions. This can be filtered out using the is.na() function, or its negation !is.na() which keeps values different from NA. is.na() is an example of a logical function. It can have two values: TRUE or FALSE (also NA is a possibility). Let us redefine the previous table removing the NAs and adding a table caption:

latest2 |> 
  filter(!is.na(iso3c)) |> 
  group_by(region) |> 
  slice_max(Value) |> 
  select(region,iso3c,GeoAreaName,TimePeriod,Value) |> 
  arrange(-Value) |> 
  flextable() |> 
  set_caption ("Country by region with the highest latest poverty rate (%)")
Country by region with the highest latest poverty rate (%)

region

iso3c

GeoAreaName

TimePeriod

Value

Europe & Central Asia

UZB

Uzbekistan

2,003

81.5

Sub-Saharan Africa

MDG

Madagascar

2,012

80.7

East Asia & Pacific

PNG

Papua New Guinea

2,009

39.7

Latin America & Caribbean

HTI

Haiti

2,012

29.2

Middle East & North Africa

YEM

Yemen

2,014

19.8

South Asia

BGD

Bangladesh

2,016

13.5

North America

CAN

Canada

2,018

0.2

North America

USA

United States of America

2,020

0.2

What is left to do?

We have taken a first look at how to work with table data in R. We have covered a lot of the basics we need.

Next week we will learn some other things such as: