The other day we studied basic R and some of the basic functions of the tidyverse to work with data objects: filter and slice to get rows, select to choose variables, arrange to order rows, and, most important: group_by to define groups in the data so that operations take place at the group level. We also saw left_join to put together two data sets using a common variable (by argument).
Today we will get data for global monitoring looks from the World Development Indicators database using wbstats, and learn the few basic tools we did not see yet:
rename to change variable names, mutate to create new columns through calculation.
summarize to create a new data set of summary indicators where the new observations (rows) are the groups
You can again check the references provided in Studium, Timbers, Campbell, and Lee (2023Timbers, Tiffany-Anne, Trevor Campbell, and Melissa Lee. 2023. Data Science : An Introduction. S.l: CRC Press. https://datasciencebook.ca/.), Larsen and Fazekas (2021Larsen, Erik Gahner, and Zoltán Fazekas. 2021. Quantitative Politics with R. https://github.com/erikgahner/qpolr/raw/master/book/_book/qpolr.pdf.), Wickham and Grolemund (2022Wickham, Hadley, and Garrett Grolemund. 2022. R for Data Science. O’Reilly Media, Inc. https://r4ds.had.co.nz/.). We will also use figures from Ismay and Kim (2024Ismay, Chester, and Albert Y. Kim. 2024. ModernDive: An Introduction to Statistical and Data Sciences via r. moderndive.com. https://moderndive.com/.). Note that references are listed in the margin.
We will use today tmap to draw maps. If you have not installed it yet, please do so now:
install.packages("tmap")
It is very simple to use wbstats if you know the names of the indicators from the World Development Indicators database. Today we download the 3 indicators used in gapminder: life expectancy, GDP per capita, and the total population. We also get the database of countries that we can look at.
my_indicators <- c(
life_expectancy = "SP.DYN.LE00.IN",
gdp_capita ="NY.GDP.PCAP.CD",
pop = "SP.POP.TOTL"
)
wbind <- wb_data(my_indicators, start_date = 2021)
countries <- wb_countries()
This is the structure of the two data sets.
str(wbind)
## tibble [217 × 7] (S3: tbl_df/tbl/data.frame)
## $ iso2c : chr [1:217] "AW" "AF" "AO" "AL" ...
## $ iso3c : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
## $ country : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ date : num [1:217] 2021 2021 2021 2021 2021 ...
## $ gdp_capita : num [1:217] 29128 356 1927 6377 42072 ...
## ..- attr(*, "label")= chr "GDP per capita (current US$)"
## $ life_expectancy: num [1:217] 74.6 62 61.6 76.5 NA ...
## ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
## $ pop : num [1:217] 106537 40099462 34503774 2811666 79034 ...
## ..- attr(*, "label")= chr "Population, total"
str(countries)
## tibble [296 × 18] (S3: tbl_df/tbl/data.frame)
## $ iso3c : chr [1:296] "ABW" "AFE" "AFG" "AFR" ...
## $ iso2c : chr [1:296] "AW" "ZH" "AF" "A9" ...
## $ country : chr [1:296] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ...
## $ capital_city : chr [1:296] "Oranjestad" NA "Kabul" NA ...
## $ longitude : num [1:296] -70 NA 69.2 NA NA ...
## $ latitude : num [1:296] 12.5 NA 34.5 NA NA ...
## $ region_iso3c : chr [1:296] "LCN" NA "SAS" NA ...
## $ region_iso2c : chr [1:296] "ZJ" NA "8S" NA ...
## $ region : chr [1:296] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ...
## $ admin_region_iso3c: chr [1:296] NA NA "SAS" NA ...
## $ admin_region_iso2c: chr [1:296] NA NA "8S" NA ...
## $ admin_region : chr [1:296] NA NA "South Asia" NA ...
## $ income_level_iso3c: chr [1:296] "HIC" NA "LIC" NA ...
## $ income_level_iso2c: chr [1:296] "XD" NA "XM" NA ...
## $ income_level : chr [1:296] "High income" "Aggregates" "Low income" "Aggregates" ...
## $ lending_type_iso3c: chr [1:296] "LNX" NA "IDX" NA ...
## $ lending_type_iso2c: chr [1:296] "XX" NA "XI" NA ...
## $ lending_type : chr [1:296] "Not classified" "Aggregates" "IDA" "Aggregates" ...
As we saw the other day, there are three conditions to join data sets:
In our case there are three common variables: iso2c, iso3c and country. We can join based on any of them, although it is better to use a code variable.
When there are several common variable we can do any of these things:
by argument.wbind <- wbind |>
left_join(countries, by=c("iso2c","iso3c","country"))
by variables leaving it blank. This can be problematic if not all variables share the content, for instance, if the country names differ among the data sets.wbind <- wbind |>
left_join(countries)
join and use only the selected join variable/s. Otherwise R changes the names of the variables with common name in both objects adding a suffix since otherwise there would be two variables with the same name, something forbidden in a tibble. This strategy takes a bit longer but it is safer.wbind <- wbind |>
left_join(countries |> select(-iso2c,-country), by="iso3c")
Since we know the two data sets are compatible coming from the same source, We opt for the second (shortest) choice.
wbind <- wbind |>
left_join(countries)
## Joining with `by = join_by(iso2c, iso3c, country, capital_city, longitude,
## latitude, region_iso3c, region_iso2c, region, admin_region_iso3c,
## admin_region_iso2c, admin_region, income_level_iso3c, income_level_iso2c,
## income_level, lending_type_iso3c, lending_type_iso2c, lending_type)`
Note the message telling us the variables used for joining.
Now wbind has all the information in country:
str(wbind)
## tibble [217 × 22] (S3: tbl_df/tbl/data.frame)
## $ iso2c : chr [1:217] "AW" "AF" "AO" "AL" ...
## $ iso3c : chr [1:217] "ABW" "AFG" "AGO" "ALB" ...
## $ country : chr [1:217] "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ date : num [1:217] 2021 2021 2021 2021 2021 ...
## $ gdp_capita : num [1:217] 29128 356 1927 6377 42072 ...
## ..- attr(*, "label")= chr "GDP per capita (current US$)"
## $ life_expectancy : num [1:217] 74.6 62 61.6 76.5 NA ...
## ..- attr(*, "label")= chr "Life expectancy at birth, total (years)"
## $ pop : num [1:217] 106537 40099462 34503774 2811666 79034 ...
## ..- attr(*, "label")= chr "Population, total"
## $ capital_city : chr [1:217] "Oranjestad" "Kabul" "Luanda" "Tirane" ...
## $ longitude : num [1:217] -70.02 69.18 13.24 19.82 1.52 ...
## $ latitude : num [1:217] 12.52 34.52 -8.81 41.33 42.51 ...
## $ region_iso3c : chr [1:217] "LCN" "SAS" "SSF" "ECS" ...
## $ region_iso2c : chr [1:217] "ZJ" "8S" "ZG" "Z7" ...
## $ region : chr [1:217] "Latin America & Caribbean" "South Asia" "Sub-Saharan Africa" "Europe & Central Asia" ...
## $ admin_region_iso3c: chr [1:217] NA "SAS" "SSA" "ECA" ...
## $ admin_region_iso2c: chr [1:217] NA "8S" "ZF" "7E" ...
## $ admin_region : chr [1:217] NA "South Asia" "Sub-Saharan Africa (excluding high income)" "Europe & Central Asia (excluding high income)" ...
## $ income_level_iso3c: chr [1:217] "HIC" "LIC" "LMC" "UMC" ...
## $ income_level_iso2c: chr [1:217] "XD" "XM" "XN" "XT" ...
## $ income_level : chr [1:217] "High income" "Low income" "Lower middle income" "Upper middle income" ...
## $ lending_type_iso3c: chr [1:217] "LNX" "IDX" "IBD" "IBD" ...
## $ lending_type_iso2c: chr [1:217] "XX" "XI" "XF" "XF" ...
## $ lending_type : chr [1:217] "Not classified" "IDA" "IBRD" "IBRD" ...
See what these applications from the other day do:
wbind |>
filter(admin_region=="South Asia") |>
select(country,gdp_capita:pop) |>
flextable() |> # Create nice table
colformat_double(digits = 1) # Limit to one the decimal digits.
country | gdp_capita | life_expectancy | pop |
|---|---|---|---|
Afghanistan | 355.8 | 62.0 | 40,099,462.0 |
Bangladesh | 2,457.9 | 72.4 | 169,356,251.0 |
Bhutan | 3,560.2 | 71.8 | 777,486.0 |
India | 2,238.1 | 67.2 | 1,407,563,842.0 |
Sri Lanka | 3,996.6 | 76.4 | 22,156,000.0 |
Maldives | 10,076.3 | 79.9 | 521,457.0 |
Nepal | 1,229.4 | 68.5 | 30,034,989.0 |
Pakistan | 1,506.1 | 66.1 | 231,402,117.0 |
The effects of
group_by
wbind |>
group_by(region) |>
arrange(life_expectancy) |>
slice(1) |>
select(region,country,life_expectancy) |>
flextable() |> # Create nice table
colformat_double(digits = 0) # No decimal digits.
region | country | life_expectancy |
|---|---|---|
East Asia & Pacific | Nauru | 64 |
Europe & Central Asia | Moldova | 69 |
Latin America & Caribbean | Haiti | 63 |
Middle East & North Africa | Djibouti | 62 |
North America | United States | 76 |
South Asia | Afghanistan | 62 |
Sub-Saharan Africa | Chad | 53 |
The effects of
group_by
mutate and renaming variables with renameThe GDP per capita is a measure of standard of living, how much money the average person makes. Sometimes we want to compare the size of the economy. We can create new variables specifying the formula that creates them. In this case, GDP=pop*gdp_capita. We should also pay attention to units. Since gdp_capita was measured in dollars, we get too many figures. We can divide by \(1e9\) to express the size of the economy in billions of dollars.
wbind |>
mutate(GDP=pop*gdp_capita/1e9) |>
arrange(-GDP) |>
select(country,GDP,gdp_capita,pop) |>
slice(1:10) |>
rename(`Per capita GDP`=gdp_capita,
`Population`=pop) |>
flextable() |> # Create nice table
colformat_double(digits = 0) # No decimal digits.
country | GDP | Per capita GDP | Population |
|---|---|---|---|
United States | 23,315 | 70,219 | 332,031,554 |
China | 17,820 | 12,618 | 1,412,360,000 |
Japan | 5,035 | 40,059 | 125,681,593 |
Germany | 4,279 | 51,427 | 83,196,078 |
India | 3,150 | 2,238 | 1,407,563,842 |
United Kingdom | 3,142 | 46,870 | 67,026,292 |
France | 2,959 | 43,671 | 67,764,304 |
Italy | 2,155 | 36,449 | 59,133,173 |
Canada | 2,007 | 52,515 | 38,226,498 |
Korea, Rep. | 1,818 | 35,142 | 51,744,876 |
A very interesting possibility is to combine mutate with a group_by. This results in new columns that are calculated for each subtable separately. For instance, we can define at the region level the maximum life expectancy in the region minus the present one, and then check the countries whose life expectancy is farthest from their region’s leader:
wbind |>
group_by(region) |>
mutate(dif_LE=max(life_expectancy)-life_expectancy) |>
ungroup() |>
arrange(-dif_LE) |>
select(region, country,
LE=life_expectancy,
dif_LE) |>
slice(1:10) |>
flextable() |> # Create nice table
colformat_double(digits = 2) # 2 decimal digits.
region | country | LE | dif_LE |
|---|---|---|---|
Sub-Saharan Africa | Chad | 52.52 | 21.53 |
Sub-Saharan Africa | Nigeria | 52.68 | 21.38 |
Sub-Saharan Africa | Lesotho | 53.06 | 20.99 |
Middle East & North Africa | Djibouti | 62.30 | 20.56 |
Sub-Saharan Africa | Central African Republic | 53.90 | 20.16 |
Middle East & North Africa | Yemen, Rep. | 63.75 | 19.11 |
Sub-Saharan Africa | South Sudan | 54.98 | 19.08 |
Sub-Saharan Africa | Somalia | 55.28 | 18.77 |
South Asia | Afghanistan | 61.98 | 17.94 |
Sub-Saharan Africa | Eswatini | 57.07 | 16.99 |
It is almost all countries in Africa! They are lagging behind Cabo Verde where life expectancy is 74.052.
For a simple table:
For grouped data:
summarizesummarize takes a data table (or subtable, if groups are defined), and defines a summary indicator for the complete table. It is generally used in combination with functions such as n(), for counts, max() or min() (maximum and minimum of a variable in a group), mean() or median(), providing average levels. In the case of average life expectancy at the region level, we should not use the simple average because some countries have more population than others. We should use the weighted average, giving more weight according to population.
## Average (weighted) levels by continent
wbind |>
group_by(region) |> # Analyze each region separately
drop_na(pop) |> # Drop countries with no pop data (Eritrea)
summarize(`Weighted average`=weighted.mean(life_expectancy,pop,na.rm=TRUE),
`Simple average`=mean(life_expectancy,na.rm=TRUE),
Countries=n(),
Missing=sum(is.na(life_expectancy)),
`Pop (millions)`=sum(pop)/1e6
) |> # Calculate summary statistics by group
arrange(`Weighted average`) |> # Order regions according to increasing average level
flextable() |> # Create nice table
colformat_double(digits = 1) |>
set_caption("Average life expectancy by region")
region | Weighted average | Simple average | Countries | Missing | Pop (millions) |
|---|---|---|---|---|---|
Sub-Saharan Africa | 60.2 | 61.8 | 48 | 0 | 1,181.2 |
South Asia | 67.6 | 70.5 | 8 | 0 | 1,901.9 |
Latin America & Caribbean | 72.1 | 73.0 | 42 | 2 | 655.0 |
Middle East & North Africa | 72.7 | 74.4 | 21 | 0 | 486.2 |
East Asia & Pacific | 76.3 | 73.6 | 37 | 3 | 2,346.7 |
Europe & Central Asia | 76.6 | 77.1 | 58 | 3 | 923.6 |
North America | 77.0 | 79.4 | 3 | 0 | 370.3 |
We see the weighted average is generally smaller. This is because small countries (eg: Monaco, Andorra, Mauritius, Hong-Kong), tend to have higher life expectancy.
Basic, necessary, elements:
Optional layers:
# Doing graphs
The next day we will explore in detail how graphs are done using {ggplot2}. The idea is to:
Get the data object, generally a tibble.
Define what is in the paper, the aesthetics: Which are the \(x\) and \(y\) variables, do we want to use color linked to another variable, or size, …
Add the geometrical objects geom_xxx that we want.
Adding things to the graphic is done through + (instead of |>)!
Finally there is the possibility of adding more layers. This is a summary from this source
This is the code producing our graph. Try to make sense of each layer.
wbind |>
ggplot(aes(
x = gdp_capita,
y = life_expectancy
)) +
geom_point(aes(size=pop, color=region)) +
geom_smooth(se = FALSE) +
scale_x_log10(
labels = scales::dollar_format(),
breaks = scales::log_breaks(n = 4)
) +
scale_size_continuous(
labels = scales::number_format(scale = 1/1e6, suffix = "m"),
breaks = seq(1e8,1e9, 2e8),
range = c(1,20)
) +
theme_minimal() +
labs(
title = "Life expectancy according to GDP per capita",
x = "GDP per Capita",
y = "Life expectancy (years)",
size = "Population",
color = NULL,
caption = "Source: World Bank, through `wbstats`"
)
ggplot2Mapping the data is also relatively simple. We only need to specify a base map to which the color will depend on per-capita income. We can do it with ggplot2 replicating the vignette of {wbstats}.
One possibility is the world map in package tmap. We get it in our environment by:
library(tmap)
data(World)
The object is an sf (simple features) object. Basically a data object with one specific variable, geometry, that contains the polygons defining the country map.
str(World)
## Classes 'sf' and 'data.frame': 177 obs. of 16 variables:
## $ iso_a3 : Factor w/ 177 levels "AFG","AGO","ALB",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ name : Factor w/ 177 levels "Afghanistan",..: 1 4 2 166 6 7 5 56 8 9 ...
## $ sovereignt : Factor w/ 171 levels "Afghanistan",..: 1 4 2 159 6 7 5 52 8 9 ...
## $ continent : Factor w/ 8 levels "Africa","Antarctica",..: 3 1 4 3 8 3 2 7 6 4 ...
## $ area : Units: [km^2] num 652860 1246700 27400 71252 2736690 ...
## $ pop_est : num 28400000 12799293 3639453 4798491 40913584 ...
## $ pop_est_dens: num 43.5 10.3 132.8 67.3 15 ...
## $ economy : Factor w/ 7 levels "1. Developed region: G7",..: 7 7 6 6 5 6 6 6 2 2 ...
## $ income_grp : Factor w/ 5 levels "1. High income: OECD",..: 5 3 4 2 3 4 2 2 1 1 ...
## $ gdp_cap_est : num 784 8618 5993 38408 14027 ...
## $ life_exp : num 59.7 NA 77.3 NA 75.9 ...
## $ well_being : num 3.8 NA 5.5 NA 6.5 4.3 NA NA 7.2 7.4 ...
## $ footprint : num 0.79 NA 2.21 NA 3.14 2.23 NA NA 9.31 6.06 ...
## $ inequality : num 0.427 NA 0.165 NA 0.164 ...
## $ HPI : num 20.2 NA 36.8 NA 35.2 ...
## $ geometry :sfc_MULTIPOLYGON of length 177; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:69, 1:2] 61.2 62.2 63 63.2 64 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:15] "iso_a3" "name" "sovereignt" "continent" ...
class(World)
## [1] "sf" "data.frame"
It already has some variables that could be used for maps.
We can also see the map by:
ggplot(data = World) +
geom_sf()
In order to map our data we have to bring it to the World object. Let us see how:
mi_mapa=World |>
filter(iso_a3 != "ATA") |> # remove Antarctica
left_join(wbind |> rename(iso_a3=iso3c) |> select(iso_a3,gdp_capita, pop,life_expectancy),
by="iso_a3")
Now we do the new map
mi_mapa |> ggplot(aes(fill = life_expectancy)) +
geom_sf() +
scale_fill_viridis_c() +
theme(legend.position="bottom") +
labs(
title = "Map of life expectancy in 2021",
fill = NULL,
caption = paste("Source: World Bank")
) +
coord_sf()