1 The ESdata Package

ESdata is an R package that allows access to statistical information from Spain structured as ordered data (tidy data). If you have previously installed the devtools package, you can install the package from Github by doing:

devtools::install_github("jmsallan/ESdata")

Once the package is installed, we can access the data by doing:

library(ESdata)

The package allows access to a set of data structured in data frames, with the only dependency on having a version of R equal or higher than 3.5.0. A good alternative to explore this data is to use the tidyverse packages:

library(tidyverse)

In its current version, the package has information related to:

Employment: Active Population Survey, Encuesta de Población Activa (EPA).
Inflation: Consumer Price Index, Índice de precios al consumo (IPC).
Growth: Gross Domestic Product (GDP) at market prices.

In this document we will show the information available on the Active Population Survey (EPA).

2 The Encuesta de Población Activa (EPA)

The Active Population Survey, in Spanish Encuesta de Población Activa (EPA) is a statistical study designed to capture data on the labor market, following the standards of the International Labor Organization (ILO). Given that unemployment is one of the most relevant problems of the Spanish economy, the publication of the results of the EPA each quarter is the object of the attention of the media.

In Spain, the EPA is carried out by the National Statistics Institute, in Spanish Instituto Nacional de Estadística (INE). Data is collected quarterly, surveying 160,000 people residing in 65,000 family homes. The standardized methodological sheet of the survey can be found at: https://www.ine.es/dyngs/IOE/es/operacion.htm?numinv=30308

The tables included in the package collect part of the information published by INE in: https://www.ine.es/dynt3/inebase/es/index.htm?padre=982&capsel=983

3 EPA in ESdata

ESdata contains the following tables with information about the EPA:

Tablas EPA en ESdata
tabla	contenido
epa_edad	EPA por grupo de edad.
epa_form	EPA por nivel de formación
epa_nac	EPA por nacionalidad
epa_sector	EPA por sector

The variables of each of the tables are:

Variables en cada una de las tablas
tabla	periodo	region	edad	form	nac	sector	dato	valor
epa_edad	✓	✓	✓				✓	✓
epa_form	✓	✓		✓			✓	✓
epa_nac	✓	✓			✓		✓	✓
epa_sector	✓	✓	✓			✓		✓

For all tables, the data is found in the dato column, expressed in thousands of people. The periodo variable indicates which quarter and year the data corresponds to, and the rest of the variables are categorical that describe the situation of each row.

Contenido de las variables
variable	significado
periodo	ültimo día del trimestre del dato
region	código ISO 3166-2 de la región del dato (ver tabla)
edad	grupo de edad
form	nivel de formación (ver tabla)
nac	nacionalidad (ES: española, EX: extranjera, UE: miembro UE, no_UE: no-miembro UE)
sector	sector de actividad (agricultura, industria, construcción, servicios)
dato	pob: población, act: población activa, ocu: población ocupada, par: población parada, ina: población inactiva
valor	en miles de personas

The official name of each one of the regions (Autonomous Communities and Cities, and Spain as a whole), and its ISO 3166-2 code, which is the one that appears in the region variable, can be found in the ccaa_iso table:

Comunidades y Ciudades Autónomas de España
iso	nombres
ES	España
ES-AN	Andalucía
ES-AR	Aragón
ES-AS	Principado de Asturias
ES-IB	Illes Balears
ES-CN	Canarias
ES-CB	Cantabria
ES-CL	Castilla y León
ES-CM	Castilla - La Mancha
ES-CT	Catalunya
ES-VC	Comunidad Valenciana
ES-EX	Extremadura
ES-GA	Galicia
ES-MD	Comunidad de Madrid
ES-MC	Región de Murcia
ES-NC	Comunidad Foral de Navarra
ES-PV	País Vasco
ES-RI	La Rioja
ES-CE	Ceuta
ES-ML	Melilla

The training levels of the form variable, as defined by the INE, are:

Niveles de formación considerados en la EPA
codigo	significado
analf	Analfabetos
prim_inic	Estudios primarios incompletos
prim	Educación primaria
sec_1	Primera etapa de Educación Secundaria y similar
sec_2_gen	Segunda etapa de educación secundaria, con orientación general
sec_2_voc	Segunda etapa de educación secundaria con orientación profesional (incluye educación postsecundaria no superior)
ed_sup	Educación Superior

4 Examples of Use

Each of the tables can be consulted in multiple ways, so we must perform a previous filtering of the rows we need, and then carry out the analysis that is required. In this section we will present some examples of visualizations of the EPA data.

4.1 Total active population

For example, if we want to obtain the active population in Spain for each period we can use the epa_edad table by doing:

epa_edad %>% filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo=="total")

## # A tibble: 85 × 6
##    periodo    region edad  sexo  dato   valor
##    <date>     <chr>  <chr> <chr> <chr>  <dbl>
##  1 2023-01-31 ES     total total pob   47351.
##  2 2022-12-31 ES     total total pob   47207 
##  3 2022-09-30 ES     total total pob   47054.
##  4 2022-06-30 ES     total total pob   46934.
##  5 2022-01-31 ES     total total pob   46877.
##  6 2021-12-31 ES     total total pob   46849 
##  7 2021-09-30 ES     total total pob   46799.
##  8 2021-06-30 ES     total total pob   46816 
##  9 2021-01-31 ES     total total pob   46878.
## 10 2020-12-31 ES     total total pob   46910.
## # ℹ 75 more rows

So we can use ggplot to plot the time evolution of the active population:

epa_edad %>% filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo=="total") %>% 
  ggplot(aes(periodo, valor)) + 
  geom_line()

4.2 Evolution of the Population by Gender

In this graph we perform the following operations:

Filtering the data with filter.
Modified the levels of sex with mutate to show the levels in the same order that they appear in the graph.
Defining a line plot with ggplot and geom_line().
Changed chart title and x and y axis names with labs. To be self-explanatory, I have not put a name on the x-axis.
Modification of the legend with scale_color_manual.
White background of the graph with theme_bw().
Positioning the legend below the chart with legend.position="bottom".

epa_edad %>% 
  filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo!="total") %>% 
  mutate(sexo=factor(sexo, levels = c("mujeres", "hombres"))) %>%
  ggplot(aes(periodo, valor, col=sexo)) + 
  geom_line() + 
  labs(title="Evolución de la población (España)", x="", y="pob. (miles)") +
  scale_color_manual(values = c("#FF0000", "#0000FF")) + 
  theme_bw() +
  theme(legend.position = "bottom")

In scale_color_manual the colors of each line are defined in values using a hexadecimal format. Information on these hexadecimal codes can be obtained from:

https://www.rapidtables.com/web/color/RGB_Color.html

4.3 Total, Active and Inactive Population Aged 20 to 24 by Gender

Here we have two categorizations: one of different types of population (total, active and inactive) and another by sex for a specific age range. We use facet_grid to get a graph for each gender:

epa_edad %>% filter(region=="ES" & edad == "20-24" & sexo!="total" & dato %in% c("act", "pob", "ina")) %>% 
  mutate(dato = factor(dato, levels = c("pob", "act", "ina"))) %>%
  ggplot(aes(periodo, valor, col=dato)) + 
  geom_line() + 
  scale_color_manual(name="Población", values=c("#000000", "#0000FF", "#FF0000"), labels=c("total", "activa", "inactiva")) + 
  theme_bw() + 
  labs(title="Evolución actividad personas 20-24 años en España", x = "", y = "pob. (miles)")  + 
  theme(legend.position = "bottom") + 
  ylim(0, 1700) +
  facet_grid(. ~ sexo)

We observe that the active and inactive populations have a seasonal component, reflecting the population that becomes active in certain seasons of the year. We can offset this effect by retaining only the data for the third quarter of each year. To do this we use the month function from the lubridate package.

library(lubridate)
epa_edad %>% filter(region=="ES" & edad == "20-24" & sexo!="total" & dato %in% c("act", "pob", "ina") & month(periodo) == 9) %>% 
  mutate(dato = factor(dato, levels = c("pob", "act", "ina"))) %>%
  ggplot(aes(periodo, valor, col=dato)) + 
  geom_line() + 
  scale_color_manual(name="Población", values=c("#000000", "#0000FF", "#FF0000"), labels=c("total", "activa", "inactiva")) + 
  theme_bw() + 
  labs(title="Evolución actividad personas 20-24 años en España (tercer trimestre)", x = "", y = "pob. (miles)")  + 
  theme(legend.position = "bottom") + 
  ylim(0, 1700) +
  facet_grid(. ~ sexo)

4.4 Activity and Unemployment Rates

The data provided by the package is for the total population, and we may be interested in having data on activity rates and the unemployment rate. We can have activity rate values with age_age by doing:

\[ \text{activity rate} = \frac{\text{act}}{\text{act} + \text{ina}} \]

We can evaluate the unemployment rate with epa_edad, epa_form and epa_nac doing:

\[ \text{unemployment rate} = \frac{\text{ocu}}{\text{act}}\]

To calculate activity rates we use pivot_wider and pivot_longer from tidyr. First we use pivot_wider to generate a column for each of the data values and use mutate to calculate the rates:

epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
  select(periodo, dato, valor) %>%
  pivot_wider(names_from = "dato", values_from = "valor") %>%
  mutate(t_act=act/(act+ina), t_paro=par/act)

## # A tibble: 85 × 8
##    periodo       pob    act    ocu   par    ina t_act t_paro
##    <date>      <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
##  1 2023-01-31 47351. 23580. 20453. 3128. 16694. 0.586  0.133
##  2 2022-12-31 47207  23488. 20464. 3024  16649. 0.585  0.129
##  3 2022-09-30 47054. 23526. 20546. 2980. 16443. 0.589  0.127
##  4 2022-06-30 46934. 23387. 20468  2919. 16446. 0.587  0.125
##  5 2022-01-31 46877. 23259. 20085. 3175. 16502. 0.585  0.136
##  6 2021-12-31 46849  23289. 20185. 3104. 16418. 0.587  0.133
##  7 2021-09-30 46799. 23448. 20031  3417. 16202. 0.591  0.146
##  8 2021-06-30 46816  23216. 19672. 3544. 16418. 0.586  0.153
##  9 2021-01-31 46878. 22861. 19207. 3654. 16767. 0.577  0.160
## 10 2020-12-31 46910. 23064. 19344. 3720. 16571. 0.582  0.161
## # ℹ 75 more rows

If we want to graph the rates, we put the data in long format again, and select only the unemployment rates:

epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
  pivot_wider(names_from = "dato", values_from = "valor") %>%
  mutate(t_act=act/(act+ina), t_paro=par/act) %>% 
  pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>% 
  filter(dato %in% c("t_paro", "t_act"))

## # A tibble: 170 × 6
##    periodo    region edad  sexo  dato   valor
##    <date>     <chr>  <chr> <chr> <chr>  <dbl>
##  1 2023-01-31 ES     total total t_act  0.586
##  2 2023-01-31 ES     total total t_paro 0.133
##  3 2022-12-31 ES     total total t_act  0.585
##  4 2022-12-31 ES     total total t_paro 0.129
##  5 2022-09-30 ES     total total t_act  0.589
##  6 2022-09-30 ES     total total t_paro 0.127
##  7 2022-06-30 ES     total total t_act  0.587
##  8 2022-06-30 ES     total total t_paro 0.125
##  9 2022-01-31 ES     total total t_act  0.585
## 10 2022-01-31 ES     total total t_paro 0.136
## # ℹ 160 more rows

And now we can graph the data. We use scale_y_continuous to put the values in percentage:

epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
  pivot_wider(names_from = "dato", values_from = "valor") %>%
  mutate(t_act=act/(act+ina), t_paro=par/act) %>% 
  pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>% 
  filter(dato %in% c("t_paro", "t_act")) %>%
  ggplot(aes(periodo, valor, col=dato)) +
  geom_line() +
  scale_color_manual(name="tasas", labels=c("actividad", "paro"), values=c("#009900", "#FF0000")) +
  labs(title = "Tasas de actividad y paro en España", x="", y="tasa") + 
  scale_y_continuous(labels = scales::percent, limits=c(0,0.65)) +
  theme_bw() + 
  theme(legend.position = "bottom")

4.5 Annotations

We can make observations on the chart using the annote function, and mark values with vertical lines using geom_vline:

epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
  pivot_wider(names_from = "dato", values_from = "valor") %>%
  mutate(t_act=act/(act+ina), t_paro=par/act) %>% 
  pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>% 
  filter(dato %in% c("t_paro", "t_act")) %>%
  ggplot(aes(periodo, valor, col=dato)) +
  geom_line() +
  scale_color_manual(name="tasas", labels=c("actividad", "paro"), values=c("#009900", "#FF0000")) +
  labs(title = "Tasas de actividad y paro en España", x="", y="tasa") + 
  geom_vline(xintercept = as.Date("2008-12-31"), linetype="dashed", size=0.25) +
  annotate("text", x=as.Date("2010-06-30"), y=0.4, label="crisis del 2008", hjust="left") +
  annotate("curve", x=as.Date("2010-06-15"), y =0.4, xend=as.Date("2008-12-31"), yend=0.35, curvature=0.2, arrow = arrow(length = unit(2, "mm"))) +
  scale_y_continuous(labels = scales::percent, limits=c(0,0.65)) +
  theme_bw() + 
  theme(legend.position = "bottom")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In this graph, we show how the 2008 crisis led to an increase in the unemployment rate, without affecting the activity rate.

5 Improvement Proposals

If you have any proposals or have detected any errors, you can make a pull request at: https://github.com/jmsallan/ESdata

ESdata Package: the Active Population Survey

Jose M Sallan

04/05/2023