Tidyverse: manipulation of data objects

The tidyverse and the pipes, %>% |>

⊕ magrittr pipe triangle pipe

The so-called tidyverse is a collection of packages (including tidyr, dplyr, ggplot2) that facilitate our work with data objects. We will be working with data objects all the time. Let us learn how to select our data of interest and make some tables and plots with it.

Before doing so, let me present you with our friends, the pipes, also known as then operators. The native triangle pipe is written as |> and you can picture it as a triangle. It basically replaces the previous pipe %>% that could only be used in the tidyverse. You can write the pipe in RStudio with CTRL+SHIFT+M, and you can choose your favorite one (I recommend |>), in the Tools/Global Options/Code page of RStudio. The pipe (tuberia) or then operator, makes it easy to instruct R to do things in sequence. When you write something with the pipe such as

wakeup() |>  wash_teeth(tootbrush,toothpaste) |> 
drink_coffee()  |>  go_to_work(by=bus)

you are using a code that is as close as possible to natural language. The idea is that the result of the previous statement is used as the argument for the next function in the pipe. All the functions in the tidyverse work that way.

⊕

filter and slice for selecting rows

In such a big object a first thing is to select parts of the data. If we are interested only in some of the observations, say, those from Spain from dates after 2018, we can do it as follows

pov |>  filter(GeoAreaName=="Spain",
               TimePeriod>2018)

## # A tibble: 16 × 23
##     Goal Target Indicator SeriesCode  SeriesDescription  GeoAreaCode GeoAreaName
##    <dbl>  <dbl> <chr>     <chr>       <chr>                    <dbl> <chr>      
##  1     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  2     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  3     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  4     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  5     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  6     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  7     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  8     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
##  9     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 10     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 11     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 12     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 13     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 14     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 15     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## 16     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop…         724 Spain      
## # ℹ 16 more variables: TimePeriod <dbl>, Value <dbl>, Time_Detail <dbl>,
## #   TimeCoverage <lgl>, UpperBound <lgl>, LowerBound <lgl>, BasePeriod <dbl>,
## #   Source <chr>, GeoInfoUrl <lgl>, FootNote <chr>, Age <chr>, Location <chr>,
## #   Nature <chr>, Reporting_Type <chr>, Sex <chr>, Units <chr>

It is important to use ==,two equals. This is the comparison operator as different from the assignment operator.

A problem is that R does not fit all the variables and we don’t see what we get. We will learn later how to select variables. We can also use a nice package for sophisticated tables, flextable. Making a nice table is as simple as adding flextable() to the pipe flow.

library(flextable)
pov  |>  filter(GeoAreaName=="Spain",
                TimePeriod>2018) |>  flextable()

Goal	Target	Indicator	SeriesCode	SeriesDescription	GeoAreaCode	GeoAreaName	TimePeriod	Value	Time_Detail	BasePeriod	Source	FootNote	Age	Location	Nature	Reporting_Type	Sex	Units
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.80000	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	1.31139	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	65+	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.78794	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	15-64	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.86093	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	ALLAREA	G	G	FEMALE	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.79484	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	<15Y	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.75608	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	RURAL	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.93063	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	ALLAREA	G	G	MALE	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.94392	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	URBAN	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.02577	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	15-64	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.32191	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	<15Y	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	0.90000	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.06109	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	ALLAREA	G	G	MALE	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.19880	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	ALLAREA	G	G	FEMALE	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.34605	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	65+	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.14139	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	URBAN	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	1.06405	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Survey name: EU-SILC. Uses an income aggregate, which is not spatially deflated.	ALLAGE	RURAL	G	G	BOTHSEX	PERCENT

We can see that there are different indicators according to age and sex. If we want to concentrate on time trends for a specific population, we should filter accordingly:

pov  |>  filter(GeoAreaName=="Spain",
                Age=="ALLAGE",
                Sex=="BOTHSEX",
                Location=="ALLAREA") |>
  flextable()

Goal	Target	Indicator	SeriesCode	SeriesDescription	GeoAreaCode	GeoAreaName	TimePeriod	Value	Time_Detail	BasePeriod	Source	FootNote	Age	Location	Nature	Reporting_Type	Sex	Units
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,980	0.5	1,980	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,985	0.2	1,985	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,990	0.5	1,990	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,995	1.2	1,995	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from ECHP-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,000	0.5	2,000	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from ECHP-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,003	0.8	2,003	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,004	0.8	2,004	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,005	0.7	2,005	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,006	0.8	2,006	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,007	0.4	2,007	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,008	0.5	2,008	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,009	0.8	2,009	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,010	0.7	2,010	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,011	1.1	2,011	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,012	0.9	2,012	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,013	1.1	2,013	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,014	0.8	2,014	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,015	0.9	2,015	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,016	1.1	2,016	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,017	0.7	2,017	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,018	0.8	2,018	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,019	0.8	2,019	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,020	0.9	2,020	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from EU-SILC. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT

There is one more useful function for selecting observations: slice. slice selects by sequence. For instance, if we want the first three data points for Spain we would do:

pov  |>  
  filter(GeoAreaName=="Spain") |> 
  slice(1:3) |> 
  flextable()

Goal	Target	Indicator	SeriesCode	SeriesDescription	GeoAreaCode	GeoAreaName	TimePeriod	Value	Time_Detail	BasePeriod	Source	FootNote	Age	Location	Nature	Reporting_Type	Sex	Units
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,980	0.5	1,980	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,985	0.2	1,985	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	1,990	0.5	1,990	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS-LIS. Estimated from unit-record income data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT

We see that the proportions living in extreme poverty in Spain, while always low, are larger in the latest data points than in the 1980s.

Sometimes we are interested not in matching one value but a range of values. We already had created the objects my_countries and my_values. We can use them in the filtering in connection with the operator %in%:

pov |>  
  filter(GeoAreaName %in% my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
   flextable()

Goal	Target	Indicator	SeriesCode	SeriesDescription	GeoAreaCode	GeoAreaName	TimePeriod	Value	Time_Detail	BasePeriod	Source	FootNote	Age	Location	Nature	Reporting_Type	Sex	Units
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	24	Angola	2,000	21.4	2,000	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from HBS. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	24	Angola	2,008	14.6	2,008	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from IBEP-MICS. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	24	Angola	2,018	31.1	2,018	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from IDREA. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	508	Mozambique	1,996	82.7	1,996	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from NHS. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	508	Mozambique	2,002	80.6	2,002	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from IAF. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	508	Mozambique	2,008	70.8	2,008	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from IOF. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	508	Mozambique	2,014	64.6	2,014	2,017	Poverty and Inequality Portal, World Bank	Accessed March 21, 2023. Based on data from IOF. Estimated from unit-record consumption data.	ALLAGE	ALLAREA	G	G	BOTHSEX	PERCENT

⊕

select to select variables

The tables above are cumbersome because they display too much information that is not needed. We would like to focus on the variables of our interest. This can be done with **select**, providing the list of variables we want to select. We can provide them by name:

pov |>  
  filter(GeoAreaName %in% my_countries[1],
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
  select(GeoAreaName,TimePeriod,Value) |> 
   flextable()

GeoAreaName	TimePeriod	Value
Angola	2,000	21.4
Angola	2,008	14.6
Angola	2,018	31.1

where we see that the proportions living in extreme poverty in Angola increased in 2018.

We can also use helper-functions so that we do not need to type all the variable names:

pov |> 
  filter(GeoAreaName%in% my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX"
         ) |>
  select(starts_with("GeoArea"),TimePeriod,Value) |> 
   flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
24	Angola	2,000	21.4
24	Angola	2,008	14.6
24	Angola	2,018	31.1
508	Mozambique	1,996	82.7
508	Mozambique	2,002	80.6
508	Mozambique	2,008	70.8
508	Mozambique	2,014	64.6

We see the poverty situation is even worse in Mozambique. We also see the GeoAreaCode. This is a unique number identifier of the country or region. They are called ISO-CODES. ISO-CODES are useful for identification without having to write the complete name or caring about ortography. The problem with numeric ones is that the are difficult to retain. We will also use 3-letter codes that are easier to retain. You can find them in the link provided, there is also a package in R, countrycode, that contains all of them. To see the problems I comment, let us try to get the isocode data for some countries using the object codelist provided by countrycode.

library(countrycode)
codelist |> filter(country.name.en %in% 
              c(my_countries,"Sao Tome and Principe"))

## # A tibble: 2 × 621
##   ar5   cctld continent country.name.de country.name.de.regex country.name.en
##   <chr> <chr> <chr>     <chr>           <chr>                 <chr>          
## 1 MAF   .ao   Africa    Angola          angola                Angola         
## 2 MAF   .mz   Africa    Mosambik        mosambik              Mozambique     
## # ℹ 615 more variables: country.name.en.regex <chr>, country.name.fr <chr>,
## #   country.name.fr.regex <chr>, country.name.it <chr>,
## #   country.name.it.regex <chr>, cow.name <chr>, cowc <chr>, cown <dbl>,
## #   currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>, eurocontrol_pru <chr>,
## #   eurocontrol_statfor <chr>, eurostat <chr>, fao <dbl>, fips <chr>,
## #   gaul <dbl>, genc2c <chr>, genc3c <chr>, genc3n <chr>, gwc <chr>, gwn <dbl>,
## #   icao.region <chr>, imf <dbl>, ioc <chr>, iso.name.en <chr>, …

It finds our 2 countries, but it does not find a third former Colony of Portugal in Africa, Sao Tome and Principe, because the official name is “São Tomé & Príncipe”, despite those signs not being “English” letters. We could find it with:

codelist |> filter(country.name.en == "São Tomé & Príncipe")

## # A tibble: 1 × 621
##   ar5   cctld continent country.name.de    country.name.de.regex country.name.en
##   <chr> <chr> <chr>     <chr>              <chr>                 <chr>          
## 1 MAF   .st   Africa    Sao Tome und Prin… "\\bs(a|ã)o.?tom(e|é… São Tomé & Prí…
## # ℹ 615 more variables: country.name.en.regex <chr>, country.name.fr <chr>,
## #   country.name.fr.regex <chr>, country.name.it <chr>,
## #   country.name.it.regex <chr>, cow.name <chr>, cowc <chr>, cown <dbl>,
## #   currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>, eurocontrol_pru <chr>,
## #   eurocontrol_statfor <chr>, eurostat <chr>, fao <dbl>, fips <chr>,
## #   gaul <dbl>, genc2c <chr>, genc3c <chr>, genc3n <chr>, gwc <chr>, gwn <dbl>,
## #   icao.region <chr>, imf <dbl>, ioc <chr>, iso.name.en <chr>, …

but, how do we know? It is better to use codes.

Ordering data, arrange

Our data is currently organized according to country. If we want it to be organized according to another combination of variables we use arrange. For instance, if we want to order the data for our countries first in descending order of Value we use

pov |> 
  filter(GeoAreaName%in%my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX")  |>  
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-Value) |>  flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
508	Mozambique	1,996	82.7
508	Mozambique	2,002	80.6
508	Mozambique	2,008	70.8
508	Mozambique	2,014	64.6
24	Angola	2,018	31.1
24	Angola	2,000	21.4
24	Angola	2,008	14.6

We see the worse problems happen in Mozambique. Using -Value means in descending order for numeric variables. For reverse alphabetical order we would need desc(GeoAreaName). We can combine it with other criteria that are applied in case of ties:

pov |> 
  filter(GeoAreaName%in%my_countries,
         Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX",TimePeriod>=2000)  |>  
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(desc(GeoAreaName),TimePeriod)

## # A tibble: 6 × 4
##   GeoAreaCode GeoAreaName TimePeriod Value
##         <dbl> <chr>            <dbl> <dbl>
## 1         508 Mozambique        2002  80.6
## 2         508 Mozambique        2008  70.8
## 3         508 Mozambique        2014  64.6
## 4          24 Angola            2000  21.4
## 5          24 Angola            2008  14.6
## 6          24 Angola            2018  31.1

If we only provide the name of a variable, it is ordered in natural order. We can add several conditions separated by commas. For instance, if we want to see the 5 countries with the highest poverty rates for all the population in the latest available year(s) we would do:

pov |>
  filter(Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX") |>
  select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-TimePeriod,-Value) |> 
  slice(1:5) |> flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
360	Indonesia	2,022	2.5
170	Colombia	2,021	6.6
76	Brazil	2,021	5.8
268	Georgia	2,021	5.5
19	Americas	2,021	4.7

We see that the latest data corresponds to 2022, and among the data for that year the highest corresponds to Indonesia. However, we found that our two countries, for which there is no data available after 2018, had higher poverty rates. Note that we have to be careful about how to analyze the data. We will deal with this by choosing for each country its latest data.

⊕

Grouping data for analysis

Now we are getting to one of the key ideas in data management: In order to be sure not to miss countries, we would like to keep for each country the latest observation. This is not the same as filtering the data for 2020 or 2021 that does not contain all the countries. How to do it? In the tidyverse we define groups with group_by. After defining groups, the operations we do are defined at the group level, as if we were working with separate small data tables. For instance, to get the latest poverty data for the complete population we would do

latest = pov |> 
  filter(Location=="ALLAREA",Age=="ALLAGE",
         Sex=="BOTHSEX") |>
  group_by(GeoAreaName) |> 
  arrange(-TimePeriod) |> 
  slice(1) |> 
  ungroup()

This creates a new data object called latest containing only our data of interest. We had to use ungroup() to tell the tidyverse that, from now onwards, we don’t want to analyze by country, but data for all countries. To keep the countries with the worse poverty we would do

latest |> select(starts_with("GeoArea"),TimePeriod,Value) |>
  arrange(-Value) |> 
  slice(1:10) |> flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
860	Uzbekistan	2,003	81.5
450	Madagascar	2,012	80.7
454	Malawi	2,019	70.1
180	Democratic Republic of the Congo	2,012	69.7
728	South Sudan	2,016	67.3
108	Burundi	2,013	65.1
508	Mozambique	2,014	64.6
140	Central African Republic	2,008	61.9
894	Zambia	2,015	61.4
646	Rwanda	2,016	52.0

Note that we could have done all the analysis in one-step, but sometimes it is useful to retain objects, such as latest, for future analysis. Now the picture is quite different! Note that many of the countries with the highest values only have relatively old data. This might not be by chance: poverty surveys are expensive and technically complex, and the poorest countries do not carry them out often. Note also that the vast majority of countries belong to sub-Saharan Africa, the region of the world most problematic in terms of SDG achievement.

⊕

Joining different datasets: `left_join`

Note that the geographic information in this dataset is rather sparse: we only know the numeric code and the country name but we do not know in what country or region the country lies, or its 3-letter isocodes. We have that information in a separate object, however: codelist. How can we obtain that information from codelist and bring it to pov: by joining datasets. This is another key concept in data handling.

The key ideas for joining two datasets are the following:

They must have at least one variable in common.
The variables in common must have the same name and the same type of data.
If they do not, we should change the names or the type of data before joining.

Joining can be done according to the combination of variables specified in the by argument. If left unspecified, all the variables with the same names are tried (although there might be errors if the conditions set above are not met).

In our case, we want to use the numeric ISO code to do the join. In pov it is called GeoAreaCode and it is a numeric variable. the case of codelist, there is a variable called iso3n that has the country code and it is also numeric (You can inspect the object if you need so with codelist |> View()). In order to do the join, we need to rename the variable iso3n as GeoAreaCode prior to joining. We can do it with a special function, rename, or else directly within the select argument providing a name. If we do not want to bring all the variables in codelist to pov we could also select them prior to joining. For this time, we use all variables.

pov2 = pov |> 
  left_join(codelist |> rename(GeoAreaCode=iso3n))

We can inspect the, widened, data object with pov2 |> View. It has probably TOO MANY variables. If we want only to add the variables we are interested in we can select them prior to joining. For instance,

latest2 = latest |> 
  left_join(codelist |>
              select(GeoAreaCode=iso3n, un.name.es,
                     continent, region, iso3c),
            by="GeoAreaCode")

Now we have possibilities for further analysis. For instance, let us choose the country in each region with the highest poverty rate. That would require grouping by region.

latest2 |> group_by(region) |> 
  slice_max(Value) |> 
  select(region,iso3c,GeoAreaName,Value) |> 
  arrange(-Value) |> flextable()

region	iso3c	GeoAreaName	Value
Europe & Central Asia	UZB	Uzbekistan	81.5
Sub-Saharan Africa	MDG	Madagascar	80.7
		Middle Africa	45.8
East Asia & Pacific	PNG	Papua New Guinea	39.7
Latin America & Caribbean	HTI	Haiti	29.2
Middle East & North Africa	YEM	Yemen	19.8
South Asia	BGD	Bangladesh	13.5
North America	CAN	Canada	0.2
North America	USA	United States of America	0.2

We see here how Europe & Cental Asia and sub-Saharan Africa are on top, and how in many regions of the world, including the most populated, there is no single country with poverty rates exceeding 30%. Note also that the 3-letter isocodes are much more intuitive.

Missing data: NA

One more thing to note is that there is a “strange” data with no region name for Middle Africa. Why is this? The pov object does not only contain rates for countries. It also includes country aggregates such as middle Africa. These aggregates do not have a regional code or an iso3c code. If we look at the latest2 data object we will see that such data is shown in red and displayed as “NA”, which means not available.

Very often we do not want to keep the data that has NA in some dimensions. This can be filtered out using the is.na() function, or its negation !is.na() which keeps values different from NA. is.na() is an example of a logical function. It can have two values: TRUE or FALSE (also NA is a possibility). Let us redefine the previous table removing the NAs and adding a table caption:

latest2 |> 
  filter(!is.na(iso3c)) |> 
  group_by(region) |> 
  slice_max(Value) |> 
  select(region,iso3c,GeoAreaName,TimePeriod,Value) |> 
  arrange(-Value) |> 
  flextable() |> 
  set_caption ("Country by region with the highest latest poverty rate (%)")

Country by region with the highest latest poverty rate (%)
region	iso3c	GeoAreaName	TimePeriod	Value
Europe & Central Asia	UZB	Uzbekistan	2,003	81.5
Sub-Saharan Africa	MDG	Madagascar	2,012	80.7
East Asia & Pacific	PNG	Papua New Guinea	2,009	39.7
Latin America & Caribbean	HTI	Haiti	2,012	29.2
Middle East & North Africa	YEM	Yemen	2,014	19.8
South Asia	BGD	Bangladesh	2,016	13.5
North America	CAN	Canada	2,018	0.2
North America	USA	United States of America	2,020	0.2

What is left to do?

We have taken a first look at how to work with table data in R. We have covered a lot of the basics we need.

Next week we will learn some other things such as:

How can we define summaries, for instance, by region. We might be interested in the mean poverty rate by region, not in the maximum or minimum.
How to do graphs and maps with data such as this.

Monitoring global indicators with R and the tidyverse

Global Development Indicators, Global Studies

José Antonio Ortega

13 feb., 2024

Introduction to R

Installing

Objects, functions

Data objects: tibbles and data.frames

Tidyverse: manipulation of data objects

The tidyverse and the pipes, %>% |>

filter and slice for selecting rows

select to select variables

Ordering data, arrange

Grouping data for analysis

Joining different datasets: `left_join`

Missing data: NA

What is left to do?

Monitoring global indicators with R and the tidyverse

Global Development Indicators, Global Studies

José Antonio Ortega

13 feb., 2024

Introduction to R

Installing

Objects, functions

Data objects: tibbles and data.frames

Tidyverse: manipulation of data objects

The tidyverse and the pipes, %>% |>

filter and slice for selecting rows

select to select variables

Ordering data, arrange

Grouping data for analysis

Joining different datasets: left_join

Missing data: NA

What is left to do?

Joining different datasets: `left_join`