Processing math: 100%

Why pivoting?

We have already seen key operations in data management:

We are going to increase our set of instruments with two basic operations that arise when we want to redefine the meaning of our unit of analysis that goes in rows and the variables that go in columns. pivot_longer leads to longer data where the unit of analysis is defined by a combination of current variables, and pivot_wider makes the opposite: observations that appeared in different rows now define different variables. A complete reference on pivoting is Altman et al(2021)

Loading the data

We will work with SDG indicators downloaded with {goalie}, in particular SDG indicator 7.3.1, : “Energy intensity measured in terms of primary energy and GDP”. It measures the energy intensity level of primary energy. We can identify the name based on sdg_overview(): EG_EGY_PRIM. As specified in the metadata, a lower value indicates that less energy is used to produce one unit of output and decreasing trends indicate progress.

We load the packages and download the data with:

library(goalie)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
energy=sdg_data("EG_EGY_PRIM")

[NOTE: You might need to reinstall goalie with remotes::install_github("gpw13/goalie")]

In order to see the distribution of the data in energy we can inspect the data to confirm that for every country we have information for every year from 2000 to 2021. We can also do it by summarizing by country:

energy  |>  group_by(GeoAreaName) |> 
  summarize(first=min(TimePeriod),last=max(TimePeriod),n=n())
## # A tibble: 269 × 4
##    GeoAreaName         first  last     n
##    <chr>               <dbl> <dbl> <int>
##  1 Afghanistan          2000  2021    22
##  2 Africa               2000  2021    22
##  3 Albania              2000  2021    22
##  4 Algeria              2000  2021    22
##  5 American Samoa       2000  2021    22
##  6 Americas             2000  2021    22
##  7 Andorra              2000  2021    22
##  8 Angola               2000  2021    22
##  9 Anguilla             2000  2021    22
## 10 Antigua and Barbuda  2000  2021    22
## # ℹ 259 more rows

We can also observe the structure of the data by hitting the blue arrow in RStudio or using str.

str(energy)
## spc_tbl_ [5,918 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Goal             : num [1:5918] 7 7 7 7 7 7 7 7 7 7 ...
##  $ Target           : num [1:5918] 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 ...
##  $ Indicator        : chr [1:5918] "7.3.1" "7.3.1" "7.3.1" "7.3.1" ...
##  $ SeriesCode       : chr [1:5918] "EG_EGY_PRIM" "EG_EGY_PRIM" "EG_EGY_PRIM" "EG_EGY_PRIM" ...
##  $ SeriesDescription: chr [1:5918] "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" ...
##  $ GeoAreaCode      : num [1:5918] 1 1 1 1 1 1 1 1 1 1 ...
##  $ GeoAreaName      : chr [1:5918] "World" "World" "World" "World" ...
##  $ TimePeriod       : num [1:5918] 2000 2001 2002 2003 2004 ...
##  $ Value            : chr [1:5918] "6.11" "6.02" "5.99" "5.98" ...
##  $ Time_Detail      : num [1:5918] 2000 2001 2002 2003 2004 ...
##  $ TimeCoverage     : logi [1:5918] NA NA NA NA NA NA ...
##  $ UpperBound       : logi [1:5918] NA NA NA NA NA NA ...
##  $ LowerBound       : logi [1:5918] NA NA NA NA NA NA ...
##  $ BasePeriod       : num [1:5918] 2017 2017 2017 2017 2017 ...
##  $ Source           : chr [1:5918] "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" ...
##  $ GeoInfoUrl       : logi [1:5918] NA NA NA NA NA NA ...
##  $ FootNote         : chr [1:5918] NA NA NA NA ...
##  $ Nature           : chr [1:5918] "N" "N" "N" "N" ...
##  $ Reporting_Type   : chr [1:5918] "G" "G" "G" "G" ...
##  $ Units            : chr [1:5918] "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" ...
##  - attr(*, "problems")=<externalptr>

There is one problem we have to bear in mind: The values have been imported as text! We will need to convert them to number using as.numeric.

energy = energy |>
       mutate(Value=as.numeric(Value))

Pivot to wider

Let us consider the energy intensity in Spain, France and Portugal. The data is initially in a long format as we see:

myen=energy |> filter(GeoAreaName%in%c("Spain","France","Portugal")) |> select(GeoAreaName,TimePeriod,Value) 
myen
## # A tibble: 66 × 3
##    GeoAreaName TimePeriod Value
##    <chr>            <dbl> <dbl>
##  1 France            2000  4.33
##  2 France            2001  4.4 
##  3 France            2002  4.35
##  4 France            2003  4.43
##  5 France            2004  4.38
##  6 France            2005  4.32
##  7 France            2006  4.14
##  8 France            2007  4   
##  9 France            2008  4.02
## 10 France            2009  3.95
## # ℹ 56 more rows

In order to pivot to wider the two key arguments are names_from, the column containing the names of the new variables, and values_from, the column containing the values. We can make a summary table with data every five years using a previous filter (or a later select). The pivoting to have years as columns, and create a column with overall percent change would be:

en_years=myen |> 
  filter(TimePeriod%in%c(2000,2005,2010,2015,2020,2021)) |> 
  pivot_wider(names_from="TimePeriod",
              values_from="Value") |> 
  mutate(Change=(`2021`-`2000`)/`2000`*100)
en_years
## # A tibble: 3 × 8
##   GeoAreaName `2000` `2005` `2010` `2015` `2020` `2021` Change
##   <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 France        4.33   4.32   4      3.65   3.18   3.23  -25.4
## 2 Portugal      3.29   3.39   2.92   2.85   2.56   2.42  -26.4
## 3 Spain         3.6    3.57   3.07   2.85   2.67   2.66  -26.1

We see that in this case each country has a different line. In terms of interpretation, rates of improvement are similar in the 3 countries. On the other hand, the rankings are preserved with Portugal’s production being more efficient than in Spain, and France being less efficient.

Note that we could also pivot in a different direction, so that countries are columns and observations correspond to different years:

en_countries=myen |>
  pivot_wider(names_from=GeoAreaName,
              values_from=Value) |> 
  mutate(RelFr=France/Portugal,RelSp=Spain/Portugal)
en_countries
## # A tibble: 22 × 6
##    TimePeriod France Portugal Spain RelFr RelSp
##         <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl>
##  1       2000   4.33     3.29  3.6   1.32 1.09 
##  2       2001   4.4      3.25  3.55  1.35 1.09 
##  3       2002   4.35     3.36  3.56  1.29 1.06 
##  4       2003   4.43     3.3   3.58  1.34 1.08 
##  5       2004   4.38     3.33  3.62  1.32 1.09 
##  6       2005   4.32     3.39  3.57  1.27 1.05 
##  7       2006   4.14     3.17  3.42  1.31 1.08 
##  8       2007   4        3.12  3.35  1.28 1.07 
##  9       2008   4.02     3.03  3.21  1.33 1.06 
## 10       2009   3.95     3.08  3.07  1.28 0.997
## # ℹ 12 more rows

This highlights the differences by country in a particular year and the pace of relative efficiency by dividing by Portugal. We can see that Spain became more efficient than Portugal in 2009, but Portugal has a low advantage afterwards.

Long format with pivot_longer

While for communication purposes a wider table might be preferrable, for plotting it is better to have the data in the long format. The original data is already in long format, but we can transform it back using pivot_longer. The main arguments are, the columns that contain the data to be pivotted, in the case of en_years, 2000:2021 and names_to, the new column with the names.

en_years |>
  pivot_longer(`2000`:`2021`,names_to="Year")
## # A tibble: 18 × 4
##    GeoAreaName Change Year  value
##    <chr>        <dbl> <chr> <dbl>
##  1 France       -25.4 2000   4.33
##  2 France       -25.4 2005   4.32
##  3 France       -25.4 2010   4   
##  4 France       -25.4 2015   3.65
##  5 France       -25.4 2020   3.18
##  6 France       -25.4 2021   3.23
##  7 Portugal     -26.4 2000   3.29
##  8 Portugal     -26.4 2005   3.39
##  9 Portugal     -26.4 2010   2.92
## 10 Portugal     -26.4 2015   2.85
## 11 Portugal     -26.4 2020   2.56
## 12 Portugal     -26.4 2021   2.42
## 13 Spain        -26.1 2000   3.6 
## 14 Spain        -26.1 2005   3.57
## 15 Spain        -26.1 2010   3.07
## 16 Spain        -26.1 2015   2.85
## 17 Spain        -26.1 2020   2.67
## 18 Spain        -26.1 2021   2.66

In this case we keep the column with the percent change from 2000 to 2021. We could remove it with select(-Change).

Long format for ggplot

We can see that ggplot graphs generally require the data in long format. Two examples depending on whether we want to plot in the same graph or in separate graphs. In this case it is more informative to plot together.

myen |> 
  ggplot(aes(x=TimePeriod,y=Value,color=GeoAreaName)) +
  geom_line() +
  geom_point() +
  labs(color="Country",x="Year",y="Energy efficiency")

Note that the graph communicates more clearly what we had found in the previous tables.

The separate graphs can be plotted with facets:

myen |> 
  ggplot(aes(x=TimePeriod,y=Value)) +
  geom_line() +
  facet_grid(GeoAreaName~.) +
  labs(x="Year",y="Energy efficiency")

myen |> 
  ggplot(aes(x=TimePeriod,y=Value)) +
  geom_line() +
  facet_grid(~GeoAreaName) +
  labs(x="Year",y="Energy efficiency")

Pivoting to long format datasets in wide form: HDI database

We have studied HDI data, noting the problems of the data in excel format not including ISO codes. Fortunately, there is another link to the data in csv format. This means “comma separated values”. There is also metadata available containing the naming conventions. We can bring the data into the computer with

download.file("https://hdr.undp.org/sites/default/files/2021-22_HDR/HDR21-22_Composite_indices_complete_time_series.csv",
              "HDI.csv")

We can then import the data using the menu Import dataset when clicking on the HDI.csv icon in the Files window, or using this command:

HDI <- read_csv("HDI.csv")
## Rows: 206 Columns: 1076
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (4): iso3, country, hdicode, region
## dbl (1072): hdi_rank_2022, hdi_1990, hdi_1991, hdi_1992, hdi_1993, hdi_1994,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note there are many columns!, 1076!. This comes from the combinations of variable name and year for all variables included in the main HDI index, but also all the components!

How can we get the data in a long format? First, we need to select the variables we want to transfer. Then we need to specify the names_to argument and other options with care to specify what is a new name for a variable, what are the values. See here a detailed discussion.

HDIlong = HDI %>% select(iso3:gnipc_2022) |> 
  pivot_longer(cols = hdi_1990:gnipc_2022,
               names_to = c(".value","Year"),
               names_sep= "_",
               names_transform = list(Year=as.integer)  )

In this particular case we are using the convention that, if ".value" is used as one of the names_to, the first part of the variable name is kept as the name of the new variable. Note that the data structure has changed now to a long structure:

str(HDIlong)
## tibble [6,798 × 11] (S3: tbl_df/tbl/data.frame)
##  $ iso3         : chr [1:6798] "AFG" "AFG" "AFG" "AFG" ...
##  $ country      : chr [1:6798] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ hdicode      : chr [1:6798] "Low" "Low" "Low" "Low" ...
##  $ region       : chr [1:6798] "SA" "SA" "SA" "SA" ...
##  $ hdi_rank_2022: num [1:6798] 182 182 182 182 182 182 182 182 182 182 ...
##  $ Year         : int [1:6798] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ hdi          : num [1:6798] 0.284 0.292 0.299 0.307 0.3 0.318 0.326 0.33 0.329 0.337 ...
##  $ le           : num [1:6798] 46 46.7 47.6 51.5 51.5 ...
##  $ eys          : num [1:6798] 2.94 3.23 3.52 3.81 4.1 ...
##  $ mys          : num [1:6798] 0.872 0.915 0.959 1.002 1.045 ...
##  $ gnipc        : num [1:6798] 3116 2817 2475 1723 1202 ...

We can use the new data to create a graph similar to the HDI website:

HDIlong |>
  ggplot(aes(x=Year,y=hdi,group=country,
             color=hdicode)) +
  geom_line()

Or, in interactive form (if {plotly} is available). Note that we had to define a new country name variable that removed non-ascii characters to display correctly.

library(plotly);library(stringi)
p = HDIlong |> mutate(Country=stri_enc_toascii(country)) |> 
  ggplot(aes(x=Year,y=hdi,group=Country,
             color=hdicode)) +
  geom_line()
ggplotly(p)
19902000201020200.20.40.60.81.0
hdicodeHighLowMediumVery HighNAYearhdi

If you are interested in practicing, you can try to add format to the plot, or else, do similar graphs for the components of HDI.