Introduction to R

We are going to use R for monitoring global development indicators throughout the course. We do not need to become experts in R, but a few key ideas will make it easier for us to proceed. You can check the references provided in Studium, Timbers, Campbell, and Lee (⊕2022Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2022. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (⊕2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. http://qpolr.com/.), Wickham and Grolemund (⊕2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (⊕2022Ismay, Chester, and Albert Y. Kim. 2022. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.

Installing

To install the latest versions of R and RStudio you can check Larsen and Fazekas (2021). To install the packages we will use for this course, you can run this code in the console when you open RStudio:

install.packages("tidyverse",dependencies=TRUE)
install.packages("remotes")
remotes::install_github("caldwellst/goalie")
install.packages("wbstats")
install.packages("plotly")
install.packages("countrycode")
install.packages("flextable")
install.packages("tufte")

Objects, functions

The difficult part of R is that it is a language, not menu-driven. It therefore expects you to make an order with the > sign. You can do, for instance, arithmetic operations or text operations

2 + 2

## [1] 4

paste("My name is","R.","Pleased to meet you!")

## [1] "My name is R. Pleased to meet you!"

R works with objects. The three kinds of objects we will use most are elementary objects, data objects and functions. Objects have a name that can be assigned either through the = or the <- operators. An elementary object is a collection of elements of the same type (numeric, character, …). Let us define to elementary objects we will use later:

my_countries=c("Angola","Mozambique","Guinea-Bissau")
my_years <- c(1980,2000,2011:2020)

⊕Assignment operators: = , <-

When you assign a name to an object it becomes available in the environment. You can check it in the RStudio window with that name.

The second type of objects are functions. Functions are equivalent to verbs, they do something with the objects available or supplied. We have seen several functions already install.packages, which says what it does, paste, also intuitive, and c, from collection, that puts together differents elements in an object. You can get help regarding what R functions do in the help window or typing ? paste, for instance.

⊕function(arg1=default1,arg2=default2, …)

Functions generally require arguments, that qualify the action required. The help function provides details regarding the arguments. Also, RStudio reminds you the available arguments when you start writing the name of a function. In order for R to know the values of arguments, you can either write the name of the argument followed by the value, or rely on the order of the argument. For instance, paste has an argument sep. You can set it like this

paste(1:length(my_countries),my_countries,sep=": ")

## [1] "1: Angola"        "2: Mozambique"    "3: Guinea-Bissau"

If we want to select parts of an elementary object we do it with brackets:

my_countries[1:2]

## [1] "Angola"     "Mozambique"

Data objects: tibbles and data.frames

⊕

For monitoring indicators the key type of object is a data object. A data object is similar to a spreadsheet having columns with different variables that take values of the same type, and rows having observations. Such data is called to be in tidy form.

Data can be obtained from a file in the proper format or through a package downloading data from a website or a server. To see how easy it is to monitor data, we will use the {goalie} package (⊕Caldwell 2022Caldwell, Seth. 2022. Goalie: R Client for the Sustainable Development Goals API. https://gpw13.github.io/goalie/.) that downloads official SDG indicators data from https://unstats.un.org/sdgs/indicators/database/. Before proceeding we need to make the relevant R packages available to our session with library.

library(goalie)
library(tidyverse)
pov=sdg_data("SI_POV_DAY1")

We can check the first lines of pov with pov.

pov

## # A tibble: 3,046 x 23
##     Goal Target Indicator SeriesCode  SeriesDescription  GeoAreaCode GeoAreaName
##    <dbl>  <dbl> <chr>     <chr>       <chr>                    <dbl> <chr>      
##  1     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  2     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  3     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  4     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  5     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  6     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  7     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  8     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
##  9     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
## 10     1    1.1 1.1.1     SI_POV_DAY1 Proportion of pop~           1 World      
## # ... with 3,036 more rows, and 16 more variables: TimePeriod <dbl>,
## #   Value <dbl>, Time_Detail <dbl>, TimeCoverage <lgl>, UpperBound <lgl>,
## #   LowerBound <lgl>, BasePeriod <lgl>, Source <chr>, GeoInfoUrl <lgl>,
## #   FootNote <chr>, Age <lgl>, Location <lgl>, Nature <chr>,
## #   Reporting_Type <chr>, Sex <lgl>, Units <chr>

As you see it is a rather large table, made up of 3046 observations and 23 variables. In order to inspect it, you can click on the object name in the Global environment window. We can also do things, similar to a spreadsheet, in the viewer pane that opened, with the difference that we will not make changes by error as in a spreadsheet. We can see that the data corresponds to the SDG indicator Proportion of population below international poverty line (%) that corresponds to SDG goal 1 and target 1.1. The current povery line is $1.90 a day at 2011 international prices. The main column containing the data is a variable called Value, and the data observations are identified by a combination of area (identified either by GeoAreaCode or GeoAreaName) and year (TimePeriod). The indicator code, SI_POV_DAY1, serves to identify the data. As you see, it is all we had to supply the function sdg_data() so that it could query the database.

Next we will learn elementary operations to gather the information we want from a data object.

Tidyverse: manipulation of data objects

The tidyverse and the pipe, %>% º

⊕

The so-called tidyverse is a collection of packages (including tidyr, dplyr, ggplot2) that facilitate our work with data objects. We will be working with data objects all the time. Let us learn how to select our data of interest and make some tables and plots with it.

Before doing so, let me present you with our friend, the pipe: %>%. You can write %>% in RStudio with CTRL+SHIFT+M. The pipe (tuberia), also known as the then operator, makes it easy to instruct R to do things in a sequence. When you write something with the pipe such as

wakeup() %>% wash_teeth(tootbrush,toothpaste) %>% 
drink_coffee() %>% go_to_work(by=bus)

you are using code that is as close as possible to natural language. The idea is that the result of the previous statement is used as the argument for the next function in the pipe. All the functions in the tidyverse work that way.

⊕

filter and slice for selecting rows

In such a big object a first thing is to select parts of the data. If we are interested only in some of the observations, say, those from Spain from dates after 2015, we can do it as follows

pov %>% filter(GeoAreaName=="Spain",TimePeriod>2015)

## # A tibble: 3 x 23
##    Goal Target Indicator SeriesCode  SeriesDescription   GeoAreaCode GeoAreaName
##   <dbl>  <dbl> <chr>     <chr>       <chr>                     <dbl> <chr>      
## 1     1    1.1 1.1.1     SI_POV_DAY1 Proportion of popu~         724 Spain      
## 2     1    1.1 1.1.1     SI_POV_DAY1 Proportion of popu~         724 Spain      
## 3     1    1.1 1.1.1     SI_POV_DAY1 Proportion of popu~         724 Spain      
## # ... with 16 more variables: TimePeriod <dbl>, Value <dbl>, Time_Detail <dbl>,
## #   TimeCoverage <lgl>, UpperBound <lgl>, LowerBound <lgl>, BasePeriod <lgl>,
## #   Source <chr>, GeoInfoUrl <lgl>, FootNote <chr>, Age <lgl>, Location <lgl>,
## #   Nature <chr>, Reporting_Type <chr>, Sex <lgl>, Units <chr>

It is important to use ==,two equals. This is the comparison operator as different from the assignment operator.

A problem is that R does not fit all the variables and we don’t see what we get. We will learn later how to select variables. We can also use a nice package for sophisticated tables, flextable. Making a nice table is as simple as adding flextable() to the pipe flow.

library(flextable)
pov %>% filter(GeoAreaName=="Spain",TimePeriod>2015) %>% flextable()

There is one more useful function for selecting observations: slice. slice selects by sequence. For instance, if we want the first three data points for Spain we would do:

pov %>% 
  filter(GeoAreaName=="Spain") %>% 
  slice(1:3) %>% 
  flextable()

We see that the proportions living in extreme poverty in Spain, while always low, are larger in the latest data points than in the 1980s.

Sometimes we are interested not in matching one value but a range of values. We already had created the objects my_countries and my_values. We can use them in the filtering in connection with the operator %in%:

pov %>% 
  filter(GeoAreaName %in% my_countries) %>% 
   flextable()

⊕

select to select variables

The tables above are cumbersome because they display too much information that is not needed. We would like to focus on the variables of our interest. This can be done with **select**, providing the list of variables we want to select. We can provide them by name:

pov %>% 
  filter(GeoAreaName== my_countries[1]) %>% 
  select(GeoAreaName,TimePeriod,Value) %>% 
   flextable()

GeoAreaName	TimePeriod	Value
Angola	2,000	36.4
Angola	2,008	34.4
Angola	2,018	49.9

where we see that the proportions living in extreme poverty in Angola increased in 2018.

We can also use helper-functions so that we do not need to type all the variable names:

pov %>% 
  filter(GeoAreaName%in% my_countries,
         TimePeriod>=2010) %>% 
  select(starts_with("GeoArea"),TimePeriod,Value) %>% 
   flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
24	Angola	2,018	49.9
508	Mozambique	2,014	63.7
624	Guinea-Bissau	2,010	68.4

We see the poverty situation is even worse in Mozambique and Guinea-Bissau. We also see the GeoAreaCode. This is a unique number identifier of the country or region. They are called ISO-CODES. ISO-CODES are useful for identification without having to write the complete name or caring about ortography. The problem with numeric ones is that the are difficult to retain. We will also use 3-letter codes that are easier to retain. You can find them in the link provided, there is also a package in R, countrycode, that contains all of them. To see the problems I comment, let us try to get the isocode data for some countries using the object codelist provided by countrycode.

library(countrycode)
codelist %>% filter(country.name.en %in% 
              c(my_countries,"Sao Tome and Principe"))

## # A tibble: 3 x 743
##   ar5   cctld continent country.name.de country.name.de.regex    country.name.en
##   <chr> <chr> <chr>     <chr>           <chr>                    <chr>          
## 1 MAF   .ao   Africa    Angola          angola                   Angola         
## 2 MAF   .gw   Africa    Guinea-Bissau   bissau|^(?=.*portu).*gu~ Guinea-Bissau  
## 3 MAF   .mz   Africa    Mosambik        mosambik                 Mozambique     
## # ... with 737 more variables: country.name.en.regex <chr>, cow.name <chr>,
## #   cowc <chr>, cown <dbl>, currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>,
## #   eurocontrol_pru <chr>, eurocontrol_statfor <chr>, eurostat <chr>,
## #   fao <dbl>, fips <chr>, gaul <dbl>, genc2c <chr>, genc3c <chr>,
## #   genc3n <chr>, gwc <chr>, gwn <dbl>, icao.region <chr>, imf <dbl>,
## #   ioc <chr>, iso.name.en <chr>, iso.name.fr <chr>, iso2c <chr>, iso3c <chr>,
## #   iso3n <dbl>, iso4217c <chr>, iso4217n <dbl>, p4.name <chr>, p4c <chr>, ...

It finds our 3 countries, but it does not find a fourth former Colony of Portugal in Africa, Sao Tome and Principe, because the official name is “São Tomé & Príncipe”, despite those signs not being “English” letters. We could find it with:

codelist %>% filter(country.name.en == "São Tomé & Príncipe")

## # A tibble: 1 x 743
##   ar5   cctld continent country.name.de       country.name.de.r~ country.name.en
##   <chr> <chr> <chr>     <chr>                 <chr>              <chr>          
## 1 MAF   .st   Africa    Sao Tome und Principe "\\bs(a|ã)o.?tom(~ São Tomé & Prí~
## # ... with 737 more variables: country.name.en.regex <chr>, cow.name <chr>,
## #   cowc <chr>, cown <dbl>, currency <chr>, dhs <chr>, ecb <chr>, eu28 <chr>,
## #   eurocontrol_pru <chr>, eurocontrol_statfor <chr>, eurostat <chr>,
## #   fao <dbl>, fips <chr>, gaul <dbl>, genc2c <chr>, genc3c <chr>,
## #   genc3n <chr>, gwc <chr>, gwn <dbl>, icao.region <chr>, imf <dbl>,
## #   ioc <chr>, iso.name.en <chr>, iso.name.fr <chr>, iso2c <chr>, iso3c <chr>,
## #   iso3n <dbl>, iso4217c <chr>, iso4217n <dbl>, p4.name <chr>, p4c <chr>, ...

but, how do we know? It is better to use codes.

Ordering data, arrange

Our data is currently organized according to country. If we want it to be organized according to another combination of variables we use arrange. For instance, if we want to order the data for our countries first in descending order of Value we use

pov %>% 
  filter(GeoAreaName%in% my_countries) %>% 
  select(starts_with("GeoArea"),TimePeriod,Value) %>%
  arrange(-Value) %>%  flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
508	Mozambique	1,996	82.0
508	Mozambique	2,002	79.8
508	Mozambique	2,008	69.7
624	Guinea-Bissau	2,010	68.4
508	Mozambique	2,014	63.7
624	Guinea-Bissau	2,002	56.6
24	Angola	2,018	49.9
624	Guinea-Bissau	1,991	44.3
624	Guinea-Bissau	1,993	44.1
24	Angola	2,000	36.4
24	Angola	2,008	34.4

We see the worse problems happen in Mozambique. Using -Value means in descending order for numeric variables. For reverse alphabetical order we would need desc(GeoAreaName):

pov %>% 
  filter(GeoAreaName%in% my_countries,
         TimePeriod>=2010) %>%
  select(starts_with("GeoArea"),TimePeriod,Value) %>%
  arrange(desc(GeoAreaName))

## # A tibble: 3 x 4
##   GeoAreaCode GeoAreaName   TimePeriod Value
##         <dbl> <chr>              <dbl> <dbl>
## 1         508 Mozambique          2014  63.7
## 2         624 Guinea-Bissau       2010  68.4
## 3          24 Angola              2018  49.9

If we only provide the name of a variable, it is ordered in natural order. We can add several conditions separated by commas. For instance, if we want to see the 5 countries with the highest povery rates in the latest year available we would do:

pov %>% 
  select(starts_with("GeoArea"),TimePeriod,Value) %>%
  arrange(-TimePeriod,-Value) %>% 
  slice(1:5) %>% flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
716	Zimbabwe	2,019	39.5
11	Western Africa	2,019	34.5
340	Honduras	2,019	14.8
515	Developing regions	2,019	11.1
170	Colombia	2,019	4.9

We see that the latest data corresponds to 2019, and among the data for that year the highest corresponds to Zimbabwe. However, we found that our three countries, for which there is no data available in 2019, had higher poverty rates. Note that we have to be careful about how to analyze the data.

⊕

Grouping data for analysis

Now we are getting to one of the key ideas in data management: In order to be sure not to miss countries, we would like to keep for each country the latest observation. This is not the same as filtering the data for 2019 that does not contain all the countries. How to do it? In the tidyverse we define groups with group_by. After defining groups, the operations we do are defined at the group level, as if there were many small data tables. To get the latest data we would do

latest = pov %>% 
  group_by(GeoAreaName) %>% 
  arrange(-TimePeriod) %>% 
  slice(1) %>% 
  ungroup()

We had to use ungroup() to tell the tidyverse that, from now onwards, we don’t want to analyze by country, but data for all countries. To keep the countries with the worse data we would do

latest %>% select(starts_with("GeoArea"),TimePeriod,Value) %>%
  arrange(-Value) %>% 
  slice(1:10) %>% flextable()

GeoAreaCode	GeoAreaName	TimePeriod	Value
450	Madagascar	2,012	78.8
180	Democratic Republic of the Congo	2,012	77.2
728	South Sudan	2,016	76.4
108	Burundi	2,013	72.8
454	Malawi	2,016	69.2
706	Somalia	2,017	68.6
624	Guinea-Bissau	2,010	68.4
140	Central African Republic	2,008	65.9
508	Mozambique	2,014	63.7
860	Uzbekistan	2,003	61.6

Note we could have done all the analysis in one-step, but sometimes it is useful to retain objects, such as latest, for future analysis. Now the picture is quite different! Note that many of the countries with the highest values only have relatively old data. This might not be by chance: poverty surveys are expensive and technically complex and the poorest countries do not carry them out often. Note also that all the top countries belong to sub-Saharan Africa, the region of the world most problematic in terms of SDG achievement.

⊕

Joining different datasets: `left_join`

Note that the geographic information in this dataset is rather sparse: we only know the numeric code and the country name but we do not know in what country or region the country lies, or its 3-letter isocodes. We have that information in a separate object, however: codelist. How can we obtain that information from codelist and bring it to pov: by joining datasets. This is another key concept in data handling.

The key ideas for joining two datasets are the following:

They must have at least one variable in common.
The variables in common must have the same name and the same type of data.
If they do not, we should change the names or the type of data before joining.

Joining can be done according to the combination of variables specified in the by argument. If left unspecified, all the variables with the same names are tried (although there might be errors if the conditions set above are not met).

In our case, we want to use the numeric ISO code to do the join. In pov it is called GeoAreaCode and it is a numeric variable. the case of codelist, there is a variable called iso3n that has the country code and it is also numeric (You can inspect the object if you need so with codelist %>% View). In order to do the join, we need to rename the variable iso3n as GeoAreaCode prior to joining. We can do it with a special function, rename, or else directly within the select argument providing a name. If we do not want to bring all the variables in codelist to pov we could also select them prior to joining. For this time, we use all variables.

pov2 = pov %>% 
  left_join(codelist %>% rename(GeoAreaCode=iso3n))

We can inspect the, widened, data object with pov2 %>% View. It has probably TOO MANY variables. If we want only to add the variables we are interested in we can select them prior to joining. For instance,

latest2 = latest %>% 
  left_join(codelist %>%
              select(GeoAreaCode=iso3n, un.name.es,
                     continent, region, iso3c),
            by="GeoAreaCode")

Now we have possibilities for further analysis. For instance, let us choose the country in each region with the highest poverty rate. That would require grouping by region.

latest2 %>% group_by(region) %>% 
  slice_max(Value) %>% 
  select(region,iso3c,GeoAreaName,Value) %>% 
  arrange(-Value)

## # A tibble: 8 x 4
## # Groups:   region [8]
##   region                     iso3c GeoAreaName              Value
##   <chr>                      <chr> <chr>                    <dbl>
## 1 Sub-Saharan Africa         MDG   Madagascar                78.8
## 2 Europe & Central Asia      UZB   Uzbekistan                61.6
## 3 <NA>                       <NA>  Middle Africa             53.8
## 4 East Asia & Pacific        PNG   Papua New Guinea          38  
## 5 Latin America & Caribbean  HTI   Haiti                     24.5
## 6 South Asia                 IND   India                     22.5
## 7 Middle East & North Africa YEM   Yemen                     18.3
## 8 North America              USA   United States of America   1

We see here how sub-Saharan Africa is on top, and how in many regions of the world, including the most populated, there is no single country with poverty rates exceeding 25%. Note also that the 3-letter isocodes are much more intuitive.

Missing data: NA

One more thing to note is that there is a “strange” data with no region name for Middle Africa. Why is this? The pov object does not only contain rates for countries. Also for country aggregates such as middle Africa. These aggregates do not have a regional code or an iso3c code. If we look at the latest2 data object we will see that such data is shown in red and displayed as “NA”, which means not available.

Very often we do not want to keep the data that has NA in some dimensions. This can be filtered out using the is.na() function, or its negation !is.na() which keeps values different from NA. is.na() is an example of a logical function. It can have two value: TRUE or FALSE (also NA is a possibility). Let us redefine the previous table removing the NAs and adding a table caption:

latest2 %>% 
  filter(!is.na(iso3c)) %>% 
  group_by(region) %>% 
  slice_max(Value) %>% 
  select(region,iso3c,GeoAreaName,TimePeriod,Value) %>% 
  arrange(-Value) %>% 
  flextable() %>% 
  set_caption ("Country by region with the highest latest poverty rate (%)")

Country by region with the highest latest poverty rate (%)
region	iso3c	GeoAreaName	TimePeriod	Value
Sub-Saharan Africa	MDG	Madagascar	2,012	78.8
Europe & Central Asia	UZB	Uzbekistan	2,003	61.6
East Asia & Pacific	PNG	Papua New Guinea	2,009	38.0
Latin America & Caribbean	HTI	Haiti	2,012	24.5
South Asia	IND	India	2,011	22.5
Middle East & North Africa	YEM	Yemen	2,014	18.3
North America	USA	United States of America	2,018	1.0

What is left to do?

We have taken a first look at how to work with table data in R. We have covered a lot of the basics we need.

Next week we will learn some other things such as:

How can we define summaries, for instance, by region. We might be interested in the mean poverty rate by region, not in the maximum or minimum.
How to do graphs and maps with data such as this one.

Monitoring global indicators with R and the tidyverse

Global Development Indicators, Global Studies

José Antonio Ortega

15 feb., 2022

Introduction to R

Installing

Objects, functions

Data objects: tibbles and data.frames

Tidyverse: manipulation of data objects

The tidyverse and the pipe, %>% º

filter and slice for selecting rows

select to select variables

Ordering data, arrange

Grouping data for analysis

Joining different datasets: `left_join`

Missing data: NA

What is left to do?

Goal	Target	Indicator	SeriesCode	SeriesDescription	GeoAreaCode	GeoAreaName	TimePeriod	Value	Time_Detail	TimeCoverage	UpperBound	LowerBound	BasePeriod	Source	GeoInfoUrl	FootNote	Age	Location	Nature	Reporting_Type	Sex	Units
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,016	1.2	2,016					World Bank, Development Research Group. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, please see PovcalNet (http://iresearch.worldbank.org/PovcalNet/index.htm).		Based on data from EU-SILC. Estimated from unit-record income data.			E	G		PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,017	0.7	2,017					World Bank, Development Research Group. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, please see PovcalNet (http://iresearch.worldbank.org/PovcalNet/index.htm).		Based on data from EU-SILC. Estimated from unit-record income data.			E	G		PERCENT
1	1.1	1.1.1	SI_POV_DAY1	Proportion of population below international poverty line (%)	724	Spain	2,018	0.9	2,018					World Bank, Development Research Group. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. Data for high-income economies are from the Luxembourg Income Study database. For more information and methodology, please see PovcalNet (http://iresearch.worldbank.org/PovcalNet/index.htm).		Based on data from EU-SILC. Estimated from unit-record income data.			E	G		PERCENT

Monitoring global indicators with R and the tidyverse

Global Development Indicators, Global Studies

José Antonio Ortega

15 feb., 2022

Introduction to R

Installing

Objects, functions

Data objects: tibbles and data.frames

Tidyverse: manipulation of data objects

The tidyverse and the pipe, ** %>% **º

filter and slice for selecting rows

select to select variables

Ordering data, arrange

Grouping data for analysis

Joining different datasets: left_join

Missing data: NA

What is left to do?

The tidyverse and the pipe, %>% º

Joining different datasets: `left_join`