ESdata is an R package that allows access to statistical
information from Spain structured as ordered data (tidy data).
If you have previously installed the devtools package, you
can install the package from Github by doing:
devtools::install_github("jmsallan/ESdata")
Once the package is installed, we can access the data by doing:
library(ESdata)
The package allows access to a set of data structured in data
frames, with the only dependency on having a version of R equal or
higher than 3.5.0. A good alternative to explore this data
is to use the tidyverse packages:
library(tidyverse)
In its current version, the package has information related to:
In this document we will show the information available on the Active Population Survey (EPA).
The Active Population Survey, in Spanish Encuesta de Población Activa (EPA) is a statistical study designed to capture data on the labor market, following the standards of the International Labor Organization (ILO). Given that unemployment is one of the most relevant problems of the Spanish economy, the publication of the results of the EPA each quarter is the object of the attention of the media.
In Spain, the EPA is carried out by the National Statistics Institute, in Spanish Instituto Nacional de Estadística (INE). Data is collected quarterly, surveying 160,000 people residing in 65,000 family homes. The standardized methodological sheet of the survey can be found at: https://www.ine.es/dyngs/IOE/es/operacion.htm?numinv=30308
The tables included in the package collect part of the information published by INE in: https://www.ine.es/dynt3/inebase/es/index.htm?padre=982&capsel=983
ESdata contains the following tables with information
about the EPA:
| tabla | contenido |
|---|---|
| epa_edad | EPA por grupo de edad. |
| epa_form | EPA por nivel de formación |
| epa_nac | EPA por nacionalidad |
| epa_sector | EPA por sector |
The variables of each of the tables are:
| tabla | periodo | region | edad | form | nac | sector | dato | valor |
|---|---|---|---|---|---|---|---|---|
| epa_edad | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| epa_form | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| epa_nac | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| epa_sector | ✓ | ✓ | ✓ | ✓ | ✓ |
For all tables, the data is found in the dato column,
expressed in thousands of people. The periodo variable
indicates which quarter and year the data corresponds to, and the rest
of the variables are categorical that describe the situation of each
row.
| variable | significado |
|---|---|
| periodo | ültimo día del trimestre del dato |
| region | código ISO 3166-2 de la región del dato (ver tabla) |
| edad | grupo de edad |
| form | nivel de formación (ver tabla) |
| nac | nacionalidad (ES: española, EX: extranjera, UE: miembro UE, no_UE: no-miembro UE) |
| sector | sector de actividad (agricultura, industria, construcción, servicios) |
| dato | pob: población, act: población activa, ocu: población ocupada, par: población parada, ina: población inactiva |
| valor | en miles de personas |
The official name of each one of the regions (Autonomous Communities
and Cities, and Spain as a whole), and its ISO 3166-2 code, which is the
one that appears in the region variable, can be found in
the ccaa_iso table:
| iso | nombres |
|---|---|
| ES | España |
| ES-AN | Andalucía |
| ES-AR | Aragón |
| ES-AS | Principado de Asturias |
| ES-IB | Illes Balears |
| ES-CN | Canarias |
| ES-CB | Cantabria |
| ES-CL | Castilla y León |
| ES-CM | Castilla - La Mancha |
| ES-CT | Catalunya |
| ES-VC | Comunidad Valenciana |
| ES-EX | Extremadura |
| ES-GA | Galicia |
| ES-MD | Comunidad de Madrid |
| ES-MC | Región de Murcia |
| ES-NC | Comunidad Foral de Navarra |
| ES-PV | País Vasco |
| ES-RI | La Rioja |
| ES-CE | Ceuta |
| ES-ML | Melilla |
The training levels of the form variable, as defined by
the INE, are:
| codigo | significado |
|---|---|
| analf | Analfabetos |
| prim_inic | Estudios primarios incompletos |
| prim | Educación primaria |
| sec_1 | Primera etapa de Educación Secundaria y similar |
| sec_2_gen | Segunda etapa de educación secundaria, con orientación general |
| sec_2_voc | Segunda etapa de educación secundaria con orientación profesional (incluye educación postsecundaria no superior) |
| ed_sup | Educación Superior |
Each of the tables can be consulted in multiple ways, so we must perform a previous filtering of the rows we need, and then carry out the analysis that is required. In this section we will present some examples of visualizations of the EPA data.
For example, if we want to obtain the active population in Spain for
each period we can use the epa_edad table by doing:
epa_edad %>% filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo=="total")
## # A tibble: 85 × 6
## periodo region edad sexo dato valor
## <date> <chr> <chr> <chr> <chr> <dbl>
## 1 2023-01-31 ES total total pob 47351.
## 2 2022-12-31 ES total total pob 47207
## 3 2022-09-30 ES total total pob 47054.
## 4 2022-06-30 ES total total pob 46934.
## 5 2022-01-31 ES total total pob 46877.
## 6 2021-12-31 ES total total pob 46849
## 7 2021-09-30 ES total total pob 46799.
## 8 2021-06-30 ES total total pob 46816
## 9 2021-01-31 ES total total pob 46878.
## 10 2020-12-31 ES total total pob 46910.
## # ℹ 75 more rows
So we can use ggplot to plot the time evolution of the active population:
epa_edad %>% filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo=="total") %>%
ggplot(aes(periodo, valor)) +
geom_line()
In this graph we perform the following operations:
filter.sex with mutate to
show the levels in the same order that they appear in the graph.ggplot and
geom_line().labs.
To be self-explanatory, I have not put a name on the x-axis.scale_color_manual.theme_bw().legend.position="bottom".epa_edad %>%
filter(region=="ES" & edad =="total" & dato=="pob" & edad=="total" & sexo!="total") %>%
mutate(sexo=factor(sexo, levels = c("mujeres", "hombres"))) %>%
ggplot(aes(periodo, valor, col=sexo)) +
geom_line() +
labs(title="Evolución de la población (España)", x="", y="pob. (miles)") +
scale_color_manual(values = c("#FF0000", "#0000FF")) +
theme_bw() +
theme(legend.position = "bottom")
In scale_color_manual the colors of each line are
defined in values using a hexadecimal format. Information
on these hexadecimal codes can be obtained from:
Here we have two categorizations: one of different types of
population (total, active and inactive) and another by sex for a
specific age range. We use facet_grid to get a graph for
each gender:
epa_edad %>% filter(region=="ES" & edad == "20-24" & sexo!="total" & dato %in% c("act", "pob", "ina")) %>%
mutate(dato = factor(dato, levels = c("pob", "act", "ina"))) %>%
ggplot(aes(periodo, valor, col=dato)) +
geom_line() +
scale_color_manual(name="Población", values=c("#000000", "#0000FF", "#FF0000"), labels=c("total", "activa", "inactiva")) +
theme_bw() +
labs(title="Evolución actividad personas 20-24 años en España", x = "", y = "pob. (miles)") +
theme(legend.position = "bottom") +
ylim(0, 1700) +
facet_grid(. ~ sexo)
We observe that the active and inactive populations have a seasonal
component, reflecting the population that becomes active in certain
seasons of the year. We can offset this effect by retaining only the
data for the third quarter of each year. To do this we use the
month function from the lubridate package.
library(lubridate)
epa_edad %>% filter(region=="ES" & edad == "20-24" & sexo!="total" & dato %in% c("act", "pob", "ina") & month(periodo) == 9) %>%
mutate(dato = factor(dato, levels = c("pob", "act", "ina"))) %>%
ggplot(aes(periodo, valor, col=dato)) +
geom_line() +
scale_color_manual(name="Población", values=c("#000000", "#0000FF", "#FF0000"), labels=c("total", "activa", "inactiva")) +
theme_bw() +
labs(title="Evolución actividad personas 20-24 años en España (tercer trimestre)", x = "", y = "pob. (miles)") +
theme(legend.position = "bottom") +
ylim(0, 1700) +
facet_grid(. ~ sexo)
The data provided by the package is for the total population, and we
may be interested in having data on activity rates and the unemployment
rate. We can have activity rate values with age_age by
doing:
\[ \text{activity rate} = \frac{\text{act}}{\text{act} + \text{ina}} \]
We can evaluate the unemployment rate with epa_edad,
epa_form and epa_nac doing:
\[ \text{unemployment rate} = \frac{\text{ocu}}{\text{act}}\]
To calculate activity rates we use pivot_wider and
pivot_longer from tidyr. First we use
pivot_wider to generate a column for each of the
data values and use mutate to calculate the rates:
epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
select(periodo, dato, valor) %>%
pivot_wider(names_from = "dato", values_from = "valor") %>%
mutate(t_act=act/(act+ina), t_paro=par/act)
## # A tibble: 85 × 8
## periodo pob act ocu par ina t_act t_paro
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2023-01-31 47351. 23580. 20453. 3128. 16694. 0.586 0.133
## 2 2022-12-31 47207 23488. 20464. 3024 16649. 0.585 0.129
## 3 2022-09-30 47054. 23526. 20546. 2980. 16443. 0.589 0.127
## 4 2022-06-30 46934. 23387. 20468 2919. 16446. 0.587 0.125
## 5 2022-01-31 46877. 23259. 20085. 3175. 16502. 0.585 0.136
## 6 2021-12-31 46849 23289. 20185. 3104. 16418. 0.587 0.133
## 7 2021-09-30 46799. 23448. 20031 3417. 16202. 0.591 0.146
## 8 2021-06-30 46816 23216. 19672. 3544. 16418. 0.586 0.153
## 9 2021-01-31 46878. 22861. 19207. 3654. 16767. 0.577 0.160
## 10 2020-12-31 46910. 23064. 19344. 3720. 16571. 0.582 0.161
## # ℹ 75 more rows
If we want to graph the rates, we put the data in long format again, and select only the unemployment rates:
epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
pivot_wider(names_from = "dato", values_from = "valor") %>%
mutate(t_act=act/(act+ina), t_paro=par/act) %>%
pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>%
filter(dato %in% c("t_paro", "t_act"))
## # A tibble: 170 × 6
## periodo region edad sexo dato valor
## <date> <chr> <chr> <chr> <chr> <dbl>
## 1 2023-01-31 ES total total t_act 0.586
## 2 2023-01-31 ES total total t_paro 0.133
## 3 2022-12-31 ES total total t_act 0.585
## 4 2022-12-31 ES total total t_paro 0.129
## 5 2022-09-30 ES total total t_act 0.589
## 6 2022-09-30 ES total total t_paro 0.127
## 7 2022-06-30 ES total total t_act 0.587
## 8 2022-06-30 ES total total t_paro 0.125
## 9 2022-01-31 ES total total t_act 0.585
## 10 2022-01-31 ES total total t_paro 0.136
## # ℹ 160 more rows
And now we can graph the data. We use scale_y_continuous
to put the values in percentage:
epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
pivot_wider(names_from = "dato", values_from = "valor") %>%
mutate(t_act=act/(act+ina), t_paro=par/act) %>%
pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>%
filter(dato %in% c("t_paro", "t_act")) %>%
ggplot(aes(periodo, valor, col=dato)) +
geom_line() +
scale_color_manual(name="tasas", labels=c("actividad", "paro"), values=c("#009900", "#FF0000")) +
labs(title = "Tasas de actividad y paro en España", x="", y="tasa") +
scale_y_continuous(labels = scales::percent, limits=c(0,0.65)) +
theme_bw() +
theme(legend.position = "bottom")
We can make observations on the chart using the annote
function, and mark values with vertical lines using
geom_vline:
epa_edad %>% filter(region=="ES" & edad=="total" & sexo=="total") %>%
pivot_wider(names_from = "dato", values_from = "valor") %>%
mutate(t_act=act/(act+ina), t_paro=par/act) %>%
pivot_longer(cols = pob:t_paro, names_to = "dato", values_to = "valor") %>%
filter(dato %in% c("t_paro", "t_act")) %>%
ggplot(aes(periodo, valor, col=dato)) +
geom_line() +
scale_color_manual(name="tasas", labels=c("actividad", "paro"), values=c("#009900", "#FF0000")) +
labs(title = "Tasas de actividad y paro en España", x="", y="tasa") +
geom_vline(xintercept = as.Date("2008-12-31"), linetype="dashed", size=0.25) +
annotate("text", x=as.Date("2010-06-30"), y=0.4, label="crisis del 2008", hjust="left") +
annotate("curve", x=as.Date("2010-06-15"), y =0.4, xend=as.Date("2008-12-31"), yend=0.35, curvature=0.2, arrow = arrow(length = unit(2, "mm"))) +
scale_y_continuous(labels = scales::percent, limits=c(0,0.65)) +
theme_bw() +
theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this graph, we show how the 2008 crisis led to an increase in the unemployment rate, without affecting the activity rate.
If you have any proposals or have detected any errors, you can make a pull request at: https://github.com/jmsallan/ESdata