We have already seen key operations in data management:
select
to select variables in our datafilter
to keep a subset of observations according to
logical conditions,group_by
in combination
with mutate
to create new variables,group_by
in combination with arrange
and
slice
to keep the top values of a given variable,group_by
with summarize
to extract summary
information for each group,left_join
to put together two different data sets that
have one or more variables in common.We are going to increase our set of instruments with two basic
operations that arise when we want to redefine the meaning of our unit
of analysis that goes in rows and the variables that go in columns.
pivot_longer
leads to longer data where the unit of
analysis is defined by a combination of current variables, and
pivot_wider
makes the opposite: observations that appeared
in different rows now define different variables. A complete reference
on pivoting is Altman et
al(2021)
Examples when we need to pivot wider: We have observations for a country or region in different years and we want to have only one observation by country with columns for different years, possibly in order to later compute measures of change.
Examples of the need to pivot longer: When
making ggplot
graphs, the data generally needs to be in
long format. For instance, if we want to make a separate graph for each
year, or represent several variables in a plot, we will need to have
different observations for each year or for each of the variables. A
more complex scenario: variables might be defined by a combination of
year and variable (gni_2020
), and we want the observation
to be a country-year combination.
We will work with SDG indicators downloaded with {goalie}, in particular SDG
indicator 7.3.1, : “Energy intensity measured in terms of primary energy
and GDP”. It measures the energy intensity level of primary energy. We
can identify the name based on sdg_overview()
:
EG_EGY_PRIM. As specified in the metadata,
a lower value indicates that less energy is used to produce one unit of
output and decreasing trends indicate progress.
We load the packages and download the data with:
library(goalie)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
energy=sdg_data("EG_EGY_PRIM")
[NOTE: You might need to reinstall goalie
with
remotes::install_github("gpw13/goalie")
]
In order to see the distribution of the data in energy
we can inspect the data to confirm that for every country we have
information for every year from 2000 to 2021. We can also do it by
summarizing by country:
energy |> group_by(GeoAreaName) |>
summarize(first=min(TimePeriod),last=max(TimePeriod),n=n())
## # A tibble: 269 × 4
## GeoAreaName first last n
## <chr> <dbl> <dbl> <int>
## 1 Afghanistan 2000 2021 22
## 2 Africa 2000 2021 22
## 3 Albania 2000 2021 22
## 4 Algeria 2000 2021 22
## 5 American Samoa 2000 2021 22
## 6 Americas 2000 2021 22
## 7 Andorra 2000 2021 22
## 8 Angola 2000 2021 22
## 9 Anguilla 2000 2021 22
## 10 Antigua and Barbuda 2000 2021 22
## # ℹ 259 more rows
We can also observe the structure of the data by hitting the blue
arrow in RStudio or using str
.
str(energy)
## spc_tbl_ [5,918 × 20] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Goal : num [1:5918] 7 7 7 7 7 7 7 7 7 7 ...
## $ Target : num [1:5918] 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 7.3 ...
## $ Indicator : chr [1:5918] "7.3.1" "7.3.1" "7.3.1" "7.3.1" ...
## $ SeriesCode : chr [1:5918] "EG_EGY_PRIM" "EG_EGY_PRIM" "EG_EGY_PRIM" "EG_EGY_PRIM" ...
## $ SeriesDescription: chr [1:5918] "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" "Energy intensity level of primary energy (megajoules per constant 2017 purchasing power parity GDP)" ...
## $ GeoAreaCode : num [1:5918] 1 1 1 1 1 1 1 1 1 1 ...
## $ GeoAreaName : chr [1:5918] "World" "World" "World" "World" ...
## $ TimePeriod : num [1:5918] 2000 2001 2002 2003 2004 ...
## $ Value : chr [1:5918] "6.11" "6.02" "5.99" "5.98" ...
## $ Time_Detail : num [1:5918] 2000 2001 2002 2003 2004 ...
## $ TimeCoverage : logi [1:5918] NA NA NA NA NA NA ...
## $ UpperBound : logi [1:5918] NA NA NA NA NA NA ...
## $ LowerBound : logi [1:5918] NA NA NA NA NA NA ...
## $ BasePeriod : num [1:5918] 2017 2017 2017 2017 2017 ...
## $ Source : chr [1:5918] "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" "IEA (2023), World Energy Balances; Energy Balances, UN Statistics Division (2023)" ...
## $ GeoInfoUrl : logi [1:5918] NA NA NA NA NA NA ...
## $ FootNote : chr [1:5918] NA NA NA NA ...
## $ Nature : chr [1:5918] "N" "N" "N" "N" ...
## $ Reporting_Type : chr [1:5918] "G" "G" "G" "G" ...
## $ Units : chr [1:5918] "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" "MJ_PER_GDP_CON_PPP_USD" ...
## - attr(*, "problems")=<externalptr>
There is one problem we have to bear in mind: The values have been
imported as text! We will need to convert them to number using
as.numeric
.
energy = energy |>
mutate(Value=as.numeric(Value))
Let us consider the energy intensity in Spain, France and Portugal. The data is initially in a long format as we see:
myen=energy |> filter(GeoAreaName%in%c("Spain","France","Portugal")) |> select(GeoAreaName,TimePeriod,Value)
myen
## # A tibble: 66 × 3
## GeoAreaName TimePeriod Value
## <chr> <dbl> <dbl>
## 1 France 2000 4.33
## 2 France 2001 4.4
## 3 France 2002 4.35
## 4 France 2003 4.43
## 5 France 2004 4.38
## 6 France 2005 4.32
## 7 France 2006 4.14
## 8 France 2007 4
## 9 France 2008 4.02
## 10 France 2009 3.95
## # ℹ 56 more rows
In order to pivot to wider the two key arguments are
names_from
, the column containing the names of the new
variables, and values_from
, the column containing the
values. We can make a summary table with data every five years using a
previous filter
(or a later select
). The
pivoting to have years as columns, and create a column with overall
percent change would be:
en_years=myen |>
filter(TimePeriod%in%c(2000,2005,2010,2015,2020,2021)) |>
pivot_wider(names_from="TimePeriod",
values_from="Value") |>
mutate(Change=(`2021`-`2000`)/`2000`*100)
en_years
## # A tibble: 3 × 8
## GeoAreaName `2000` `2005` `2010` `2015` `2020` `2021` Change
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 France 4.33 4.32 4 3.65 3.18 3.23 -25.4
## 2 Portugal 3.29 3.39 2.92 2.85 2.56 2.42 -26.4
## 3 Spain 3.6 3.57 3.07 2.85 2.67 2.66 -26.1
We see that in this case each country has a different line. In terms of interpretation, rates of improvement are similar in the 3 countries. On the other hand, the rankings are preserved with Portugal’s production being more efficient than in Spain, and France being less efficient.
Note that we could also pivot in a different direction, so that countries are columns and observations correspond to different years:
en_countries=myen |>
pivot_wider(names_from=GeoAreaName,
values_from=Value) |>
mutate(RelFr=France/Portugal,RelSp=Spain/Portugal)
en_countries
## # A tibble: 22 × 6
## TimePeriod France Portugal Spain RelFr RelSp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000 4.33 3.29 3.6 1.32 1.09
## 2 2001 4.4 3.25 3.55 1.35 1.09
## 3 2002 4.35 3.36 3.56 1.29 1.06
## 4 2003 4.43 3.3 3.58 1.34 1.08
## 5 2004 4.38 3.33 3.62 1.32 1.09
## 6 2005 4.32 3.39 3.57 1.27 1.05
## 7 2006 4.14 3.17 3.42 1.31 1.08
## 8 2007 4 3.12 3.35 1.28 1.07
## 9 2008 4.02 3.03 3.21 1.33 1.06
## 10 2009 3.95 3.08 3.07 1.28 0.997
## # ℹ 12 more rows
This highlights the differences by country in a particular year and the pace of relative efficiency by dividing by Portugal. We can see that Spain became more efficient than Portugal in 2009, but Portugal has a low advantage afterwards.
pivot_longer
While for communication purposes a wider table might be preferrable,
for plotting it is better to have the data in the long format. The
original data is already in long format, but we can transform it back
using pivot_longer
. The main arguments are, the columns
that contain the data to be pivotted, in the case of
en_years
, 2000
:2021
and
names_to
, the new column with the names.
en_years |>
pivot_longer(`2000`:`2021`,names_to="Year")
## # A tibble: 18 × 4
## GeoAreaName Change Year value
## <chr> <dbl> <chr> <dbl>
## 1 France -25.4 2000 4.33
## 2 France -25.4 2005 4.32
## 3 France -25.4 2010 4
## 4 France -25.4 2015 3.65
## 5 France -25.4 2020 3.18
## 6 France -25.4 2021 3.23
## 7 Portugal -26.4 2000 3.29
## 8 Portugal -26.4 2005 3.39
## 9 Portugal -26.4 2010 2.92
## 10 Portugal -26.4 2015 2.85
## 11 Portugal -26.4 2020 2.56
## 12 Portugal -26.4 2021 2.42
## 13 Spain -26.1 2000 3.6
## 14 Spain -26.1 2005 3.57
## 15 Spain -26.1 2010 3.07
## 16 Spain -26.1 2015 2.85
## 17 Spain -26.1 2020 2.67
## 18 Spain -26.1 2021 2.66
In this case we keep the column with the percent change from 2000 to
2021. We could remove it with select(-Change)
.
We can see that ggplot
graphs generally require the data
in long format. Two examples depending on whether we want to plot in the
same graph or in separate graphs. In this case it is more informative to
plot together.
myen |>
ggplot(aes(x=TimePeriod,y=Value,color=GeoAreaName)) +
geom_line() +
geom_point() +
labs(color="Country",x="Year",y="Energy efficiency")
Note that the graph communicates more clearly what we had found in the
previous tables.
The separate graphs can be plotted with facets
:
myen |>
ggplot(aes(x=TimePeriod,y=Value)) +
geom_line() +
facet_grid(GeoAreaName~.) +
labs(x="Year",y="Energy efficiency")
myen |>
ggplot(aes(x=TimePeriod,y=Value)) +
geom_line() +
facet_grid(~GeoAreaName) +
labs(x="Year",y="Energy efficiency")
We have studied HDI
data, noting the problems of the data in excel format not including
ISO codes. Fortunately, there is another
link to the data in csv
format. This means “comma
separated values”. There is also metadata
available containing the naming conventions. We can bring the data
into the computer with
download.file("https://hdr.undp.org/sites/default/files/2021-22_HDR/HDR21-22_Composite_indices_complete_time_series.csv",
"HDI.csv")
We can then import the data using the menu
Import dataset
when clicking on the HDI.csv
icon in the Files
window, or using this command:
HDI <- read_csv("HDI.csv")
## Rows: 206 Columns: 1076
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): iso3, country, hdicode, region
## dbl (1072): hdi_rank_2022, hdi_1990, hdi_1991, hdi_1992, hdi_1993, hdi_1994,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note there are many columns!, 1076!. This comes from the combinations of variable name and year for all variables included in the main HDI index, but also all the components!
How can we get the data in a long format? First, we need to select
the variables we want to transfer. Then we need to specify the
names_to
argument and other options with care to specify
what is a new name for a variable, what are the values. See here
a detailed discussion.
HDIlong = HDI %>% select(iso3:gnipc_2022) |>
pivot_longer(cols = hdi_1990:gnipc_2022,
names_to = c(".value","Year"),
names_sep= "_",
names_transform = list(Year=as.integer) )
In this particular case we are using the convention that, if
".value"
is used as one of the names_to
, the
first part of the variable name is kept as the name of the new variable.
Note that the data structure has changed now to a long structure:
str(HDIlong)
## tibble [6,798 × 11] (S3: tbl_df/tbl/data.frame)
## $ iso3 : chr [1:6798] "AFG" "AFG" "AFG" "AFG" ...
## $ country : chr [1:6798] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ hdicode : chr [1:6798] "Low" "Low" "Low" "Low" ...
## $ region : chr [1:6798] "SA" "SA" "SA" "SA" ...
## $ hdi_rank_2022: num [1:6798] 182 182 182 182 182 182 182 182 182 182 ...
## $ Year : int [1:6798] 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
## $ hdi : num [1:6798] 0.284 0.292 0.299 0.307 0.3 0.318 0.326 0.33 0.329 0.337 ...
## $ le : num [1:6798] 46 46.7 47.6 51.5 51.5 ...
## $ eys : num [1:6798] 2.94 3.23 3.52 3.81 4.1 ...
## $ mys : num [1:6798] 0.872 0.915 0.959 1.002 1.045 ...
## $ gnipc : num [1:6798] 3116 2817 2475 1723 1202 ...
We can use the new data to create a graph similar to the HDI website:
HDIlong |>
ggplot(aes(x=Year,y=hdi,group=country,
color=hdicode)) +
geom_line()
Or, in interactive form (if {plotly}
is available). Note
that we had to define a new country name variable that removed non-ascii
characters to display correctly.
library(plotly);library(stringi)
p = HDIlong |> mutate(Country=stri_enc_toascii(country)) |>
ggplot(aes(x=Year,y=hdi,group=Country,
color=hdicode)) +
geom_line()
ggplotly(p)
If you are interested in practicing, you can try to add format to the plot, or else, do similar graphs for the components of HDI.