Retrospective survey harmonization comes with many challenges, as we have shown in the introduction to this tutorial case study. In this example, we will work with Eurobarometer’s data.
Please use the development version of retroharmonize:
devtools::install_github("antaldaniel/retroharmonize")
## Warning: package 'dplyr' was built under R version 4.0.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Warning: package 'lubridate' was built under R version 4.0.4
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
retroharmonize
is not associated with Eurobarometer, its creators (Kantar), or its archivists (GESIS). We assume that you have acquired the necessary files from GESIS after carefully reading their terms and that you have placed them on a path that you call gesis_dir. The precise documentation of the data we use can be found in this supporting blogpost. To reproduce this blogpost, you will need ZA5877_v2-0-0.sav
, ZA6595_v3-0-0.sav
, ZA6861_v1-2-0.sav
, ZA7488_v1-0-0.sav
, ZA7572_v1-0-0.sav
in a directory that you have named gesis_dir
.
# Not run in the blogpost. In the repo we have a saved version.
climate_change_files <- c("ZA5877_v2-0-0.sav", "ZA6595_v3-0-0.sav", "ZA6861_v1-2-0.sav",
"ZA7488_v1-0-0.sav", "ZA7572_v1-0-0.sav")
eb_waves <- read_surveys(file.path(gesis_dir, climate_change_files), .f='read_spss')
if (dir.exists("data-raw")) {
save ( eb_waves, file = file.path("data-raw", "eb_climate_change_waves.rda") )
}
if ( file.exists( file.path("data-raw", "eb_climate_change_waves.rda") )) {
load (file.path( "data-raw", "eb_climate_change_waves.rda" ) )
} else {
load (file.path("..", "..", "data-raw", "eb_climate_change_waves.rda") )
}
The eb_waves
nested list contains five surveys imported from SPSS to the survey class of retroharmonize. The survey class is a data.frame that retains important metadata for further harmonization.
document_waves (eb_waves)
## # A tibble: 5 x 5
## id filename ncol nrow object_size
## <chr> <chr> <int> <int> <dbl>
## 1 ZA5877_v2-0-0 ZA5877_v2-0-0.sav 604 27919 139352456
## 2 ZA6595_v3-0-0 ZA6595_v3-0-0.sav 519 27718 119370440
## 3 ZA6861_v1-2-0 ZA6861_v1-2-0.sav 657 27901 151397528
## 4 ZA7488_v1-0-0 ZA7488_v1-0-0.sav 752 27339 169465928
## 5 ZA7572_v1-0-0 ZA7572_v1-0-0.sav 348 27655 80562432
Beware of the object sizes. If you work with many surveys, memory-efficient programming becomes imperative. We will be subsetting whenever possible.
As noted before, be prepared to work with nested lists. Each imported survey is nested as a data frame in the eb_waves
list.
Eurobarometer refers to certain metadata elements, like interviewee cooperation level or the date of a survey interview as protocol variables. Let’s start here. This will be our template to harmonize more and more aspects of the five surveys (which are, in fact, already harmonizations of about 30 surveys conducted in a single ‘wave’ in multiple countries.)
# select variables of interest from the metadata
eb_protocol_metadata <- eb_climate_metadata %>%
filter ( .data$label_orig %in% c("date of interview") |
.data$var_name_orig == "rowid") %>%
suggest_var_names( survey_program = "eurobarometer" )
# subset and harmonize these variables in all nested list items of 'waves' of surveys
interview_dates <- harmonize_var_names(eb_waves,
eb_protocol_metadata )
# apply similar data processing rules to same variables
interview_dates <- lapply (interview_dates,
function (x) x %>% mutate ( date_of_interview = as_character(.data$date_of_interview) )
)
# join the individual survey tables into a single table
interview_dates <- as_tibble ( Reduce (rbind, interview_dates) )
# Check the variable classes.
vapply(interview_dates, function(x) class(x)[1], character(1))
## rowid date_of_interview
## "character" "character"
This is our sample workflow for each block of variables.
tibble
object.Now finish the harmonization. Wednesday, 31st October 2018
should become a Date type 2018-10-31
.
require(lubridate)
harmonize_date <- function(x) {
x <- tolower(as.character(x))
x <- gsub("monday|tuesday|wednesday|thursday|friday|saturday|sunday|\\,|th|nd|rd|st", "", x)
x <- gsub("decemberber", "december", x) # all those annoying real-life data problems!
x <- stringr::str_trim (x, "both")
x <- gsub("^0", "", x )
x <- gsub("\\s\\s", "\\s", x)
lubridate::dmy(x)
}
interview_dates <- interview_dates %>%
mutate ( date_of_interview = harmonize_date(.data$date_of_interview) )
vapply(interview_dates, function(x) class(x)[1], character(1))
## rowid date_of_interview
## "character" "Date"
To avoid duplication of row IDs in surveys that may not be unique in different surveys, we created a simple, sequential ID for each survey, including the ID of the original file.
set.seed(2021)
sample_n(interview_dates, 6)
## # A tibble: 6 x 2
## rowid date_of_interview
## <chr> <date>
## 1 ZA7488_v1-0-0_7016 2018-10-28
## 2 ZA7488_v1-0-0_19187 2018-11-02
## 3 ZA6861_v1-2-0_1218 2017-03-18
## 4 ZA6861_v1-2-0_4142 2017-03-21
## 5 ZA7572_v1-0-0_12363 2019-04-17
## 6 ZA7572_v1-0-0_8071 2019-04-18
After this type-conversion problem let’s see an issue when an original SPSS variable can have two meaningful R representations.
Let’s continue with harmonizing geographical information in the files. In this example, var_name_suggested
will contain the harmonized variable name. It is likely that you will have to make this call after carefully reading the original questionnaires and codebooks.
eb_regional_metadata <- eb_climate_metadata %>%
filter ( grepl( "rowid|isocntry|^nuts$|^p7", .data$var_name_orig) ) %>%
suggest_var_names( survey_program = "eurobarometer" )
harmonize_var_names()
takes all variables in the subsetted, geographical metadata table, and brings them to the harmonized var_name_suggested
name. The function subsets the surveys to avoid the presence of non-harmonized variables. All regional NUTS codes become geo
in our case:
geography <- harmonize_var_names(eb_waves,
eb_regional_metadata)
If you are used to working with single survey files, you are likely to work in a tabular format, which easily converts into a data.frame-like object. In our example, we use tidyverse’s tibble
. However, when working with longitudinal data, it is far simpler to work with nested lists, because tables usually have different dimensions (neither the rows corresponding to observations or the columns are the same across all survey files).
In the nested list, each list element is a single, tabular-format survey. In fact, the surveys are in retroharmonize’s survey class, which is a rich tibble that contains the metadata and the processing history of the survey.
The regional information in the Eurobarometer files is contained in the nuts
variable. We want to keep both the original labels and values. The original values are the region’s codes, and the labels are the names. The easiest and fastest solution is the base R lapply
loop.
# The character representation keeps the original SPSS coding.
geography_char <- lapply ( geography,
function (x) x %>% mutate( across(everything(), as.character))
)
# The label representation keeps the original SPSS labeling.
geography_label <- lapply ( geography,
function (x) x %>% mutate( across(everything(), as_character ))
)
Because each table has exactly the same columns, we can simply use rbind()
and reduce the list to a modern data.frame
, i.e. a tibble
.
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_cyprus_republic", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_greece_nuts_1_mod", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus", "region_finland_nuts_2")
Let’s see a dozen cases:
set.seed(2021) # to see the same sample
sample_n(geography_chars, 12)
## # A tibble: 12 x 57
## rowid isocntry region_france_nut~ region_france_nu~ region_belgium_nu~
## <chr> <chr> <chr> <chr> <chr>
## 1 ZA7488_v1-0~ SI 99 99 99
## 2 ZA7488_v1-0~ PL 99 99 99
## 3 ZA6861_v1-2~ DK 99 99 99
## 4 ZA6861_v1-2~ FI 99 99 99
## 5 ZA7572_v1-0~ SE 99 99 99
## 6 ZA7572_v1-0~ IT 99 99 99
## 7 ZA6861_v1-2~ IE 99 99 99
## 8 ZA6861_v1-2~ RO 99 99 99
## 9 ZA7488_v1-0~ CY 99 99 99
## 10 ZA6595_v3-0~ HR 99 99 99
## 11 ZA7572_v1-0~ CZ 99 99 99
## 12 ZA6861_v1-2~ PT 99 99 99
## # ... with 52 more variables: region_belgium_nuts_1 <chr>,
## # region_netherlands_nuts_2 <chr>, region_netherlands_nuts_1 <chr>,
## # region_germany_nuts_1 <chr>, region_italy_nuts_2_mod <chr>,
## # region_italy_nuts_1 <chr>, region_italy_nuts_1_mod <chr>,
## # region_luxembourg <chr>, region_denmark_nuts_2 <chr>, region_ireland <chr>,
## # region_united_kingdom_nuts_1 <chr>, region_united_kingdom_nuts_1_mod <chr>,
## # region_greece_nuts_2 <chr>, region_greece_nuts_1 <chr>,
## # region_spain_nuts_2 <chr>, region_spain_nuts_1 <chr>,
## # region_spain_nuts_1_mod <chr>, region_portugal_nuts_2 <chr>,
## # region_finland_nuts_3 <chr>, region_finland_nuts_2_2003_2006 <chr>,
## # region_finland_nuts_2_2010 <chr>, region_sweden_nuts_2 <chr>,
## # region_sweden_nuts_1 <chr>, region_austria_nuts_2 <chr>,
## # region_austria_nuts_1 <chr>, region_cyprus_republic <chr>,
## # region_czech_republic_nuts_2 <chr>, region_estonia <chr>,
## # region_hungary_nuts_2 <chr>, region_hungary_nuts_1 <chr>,
## # region_latvia_nuts_3 <chr>, region_lithuania_nuts_3 <chr>,
## # region_poland_nuts_2 <chr>, region_poland_nuts_1 <chr>,
## # region_slovakia_nuts_2 <chr>, region_slovenia_nuts_3 <chr>,
## # region_slovenia_nuts_2 <chr>, region_bulgaria_nuts_3 <chr>,
## # region_bulgaria_nuts_2 <chr>, region_romania_nuts_2 <chr>,
## # region_romania_nuts_1 <chr>, region_croatia <chr>, region_nuts_codes <chr>,
## # region_ireland_nuts_3 <chr>, region_ireland_nuts_2 <chr>,
## # region_greece_nuts_1_mod <chr>, region_estonia_nuts_3 <chr>,
## # region_croatia_nuts_3 <chr>, region_croatia_nuts_2 <chr>,
## # regions_all_countries <chr>, region_cyprus <chr>,
## # region_finland_nuts_2 <chr>
The idea is that we do similar variable harmonization block by block, and eventually we will join them together. Next step: socio-demography and weights.
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_cyprus_republic", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_greece_nuts_1_mod", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus", "region_finland_nuts_2")
First let’s take all coding information from the contents of the SPSS variables:
require(tidyr)
## Loading required package: tidyr
characters <- geography_chars %>%
separate ( data = ., col = rowid,
into = c("survey_id", "version", "unique_id"),
sep = "_") %>%
select ( -all_of("unique_id")) %>%
distinct_all() %>%
pivot_longer( cols = -all_of(c( "survey_id", "version", "isocntry", "region_nuts_codes")),
names_to = "variable_name",
values_to = "character_value") %>%
filter ( # remove inappropriate values - referring to missing values due to questionnaire filtering, etc.
! grepl( "^Inap|99", character_value) ) %>%
filter ( # remove missing values from the original code, these are empty values from the full_join
! is.na(character_value))
set.seed(2021)
sample_n(characters, 6)
## # A tibble: 6 x 6
## survey_id version isocntry region_nuts_codes variable_name character_value
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ZA6595 v3-0-0 LV LV003 region_latvia_nu~ 4
## 2 ZA7488 v1-0-0 LT LT006 regions_all_coun~ 7
## 3 ZA6861 v1-2-0 DE-E DEE regions_all_coun~ 15
## 4 ZA5877 v2-0-0 HU HU31 region_hungary_n~ 3
## 5 ZA6595 v3-0-0 IT ITG region_italy_nut~ 5
## 6 ZA6595 v3-0-0 LT LT00A region_lithuania~ 1
The character values in fact contain no useful information, so we reduce this table to keep all variations of the region_nuts_codes.
characters %>%
select ( -.data$character_value ) %>%
distinct_all( )
## # A tibble: 2,966 x 5
## survey_id version isocntry region_nuts_codes variable_name
## <chr> <chr> <chr> <chr> <chr>
## 1 ZA5877 v2-0-0 BE BE10 region_belgium_nuts_2
## 2 ZA5877 v2-0-0 BE BE10 region_belgium_nuts_1
## 3 ZA5877 v2-0-0 BE BE33 region_belgium_nuts_2
## 4 ZA5877 v2-0-0 BE BE33 region_belgium_nuts_1
## 5 ZA5877 v2-0-0 BE BE31 region_belgium_nuts_2
## 6 ZA5877 v2-0-0 BE BE31 region_belgium_nuts_1
## 7 ZA5877 v2-0-0 BE BE21 region_belgium_nuts_2
## 8 ZA5877 v2-0-0 BE BE21 region_belgium_nuts_1
## 9 ZA5877 v2-0-0 BE BE22 region_belgium_nuts_2
## 10 ZA5877 v2-0-0 BE BE22 region_belgium_nuts_1
## # ... with 2,956 more rows
The variable name contains misleading information, so we should not rely on it, either.
Then let’s take their labels, when they exist:
labels <- geography_labels %>%
separate ( data = ., col = rowid,
into = c("survey_id", "version", "unique_id"),
sep = "_") %>%
select ( -all_of("unique_id") ) %>%
distinct_all() %>%
pivot_longer( cols = -all_of(c( "survey_id", "version", "isocntry", "region_nuts_codes")),
names_to = "variable_name",
values_to = "value_label") %>%
filter ( ! is.na( value_label) ) %>%
filter ( ! grepl("^Inap", value_label) ) %>%
rename ( # the label representation keeps the variable labels; the labels of the code are the names
region_nuts_names = region_nuts_codes
)
And join this information:
At this point, we see duplications, because sometimes we find coding information in one survey for both NUTS1
and NUTS2
.
consistent_coding_information <- coding_information %>%
filter ( nchar(region_nuts_codes) == 4 & grepl("2", .data$variable_name) |
nchar(region_nuts_codes) == 3 & grepl("1", .data$variable_name) |
nchar(region_nuts_codes) == 5 & grepl("3", .data$variable_name) )
# luckily there is no inconsistent information present
inconsistent_coding_information <- consistent_coding_information %>%
anti_join ( coding_information,
by = c("survey_id", "version", "isocntry", "region_nuts_codes",
"region_nuts_names", "variable_name", "value_label"))
The table has some problems.
While Malta has two NUTS-3 regions, no regional information can be found in the GESIS datasets. We suggest to use MT00
and MT0
technical codes. It is not possible to reconstruct Malta’s data to MT001
and MT002
.
mt_metadata <- coding_information %>% filter ( isocntry == "MT" ) %>%
distinct_all()
mt_metadata
## # A tibble: 2 x 7
## survey_id version isocntry region_nuts_codes region_nuts_nam~ variable_name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ZA6861 v1-2-0 MT <NA> MALTA regions_all_cou~
## 2 ZA7572 v1-0-0 MT <NA> MALTA regions_all_cou~
## # ... with 1 more variable: value_label <chr>
Luxembourg does not have NUTS divisions, but the GESIS datafile contains regional information. This typology does not correspond with Luxembourgs ISO-3166-2 divisions, i.e. the 12 cantons. We suggest to use LU0
and LU00
and LU000
technical codes.
lu_metadata <- coding_information %>% filter ( isocntry == "LU" ) %>%
distinct_all()
lu_metadata
## # A tibble: 32 x 7
## survey_id version isocntry region_nuts_codes region_nuts_nam~ variable_name
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ZA5877 v2-0-0 LU LU LUXEMBOURG region_luxembo~
## 2 ZA5877 v2-0-0 LU LU LUXEMBOURG region_luxembo~
## 3 ZA5877 v2-0-0 LU LU LUXEMBOURG region_luxembo~
## 4 ZA5877 v2-0-0 LU LU LUXEMBOURG region_luxembo~
## 5 ZA6595 v3-0-0 LU LU LUXEMBOURG region_luxembo~
## 6 ZA6595 v3-0-0 LU LU LUXEMBOURG region_luxembo~
## 7 ZA6595 v3-0-0 LU LU LUXEMBOURG region_luxembo~
## 8 ZA6595 v3-0-0 LU LU LUXEMBOURG region_luxembo~
## 9 ZA6861 v1-2-0 LU LU LUXEMBOURG region_luxembo~
## 10 ZA6861 v1-2-0 LU LU LUXEMBOURG regions_all_co~
## # ... with 22 more rows, and 1 more variable: value_label <chr>
final_coding_information <- coding_information %>%
mutate ( region_nuts_codes = case_when (
.data$isocntry == "LU" ~ "LU000", # we are certain because LU has no NUTS divisions
.data$isocntry == "MT" ~ "MT00", # Kantar does not follow Malta's NUTS3,
TRUE ~ .data$region_nuts_codes
)) %>%
filter ( nchar(region_nuts_codes) == 4 & grepl("2", .data$variable_name) |
nchar(region_nuts_codes) == 3 & grepl("1", .data$variable_name) |
nchar(region_nuts_codes) == 5 & grepl("3", .data$variable_name) |
.data$isocntry %in% c("LU", "CY", "MT", "EE"))
require(regions)
## Loading required package: regions
code_table_validated <- final_coding_information %>%
mutate ( # Whatever nuts codes we have, let's see if they are valid in the last 3 NUTS definitions
# Bug fix for regions: return character vector
valid_2016 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2016)),
valid_2013 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2013)),
valid_2010 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2010))
)
As we can see we have all sorts of problems here. The regional coding is on either NUTS1
, NUTS2
or NUTS3
level, and either follows the NUTS2010
or the NUTS2013
typology’s coding.
set.seed(1997)
sample_n(code_table_validated %>%
select ( all_of(c("region_nuts_codes")),
starts_with("valid")
), 12)
## # A tibble: 12 x 4
## region_nuts_codes valid_2016 valid_2013 valid_2010
## <chr> <chr> <chr> <chr>
## 1 UKC nuts_level_1 nuts_level_1 nuts_level_1
## 2 UKN nuts_level_1 nuts_level_1 nuts_level_1
## 3 FR30 invalid nuts_level_2 nuts_level_2
## 4 ES62 nuts_level_2 nuts_level_2 nuts_level_2
## 5 HU32 nuts_level_2 nuts_level_2 nuts_level_2
## 6 UKG nuts_level_1 nuts_level_1 nuts_level_1
## 7 FR30 invalid nuts_level_2 nuts_level_2
## 8 SI021 invalid invalid nuts_level_3
## 9 BG344 nuts_level_3 nuts_level_3 nuts_level_3
## 10 CZ02 nuts_level_2 nuts_level_2 nuts_level_2
## 11 LT007 invalid nuts_level_3 nuts_level_3
## 12 FR21 invalid nuts_level_2 nuts_level_2
all_labelling <- code_table_validated %>%
select ( -all_of(c("version" )) ) %>%
distinct_all() %>%
#filter ( isocntry == "BE") %>%
unite ( col = survey_var, c("survey_id", "variable_name")) %>%
select ( all_of(c("survey_var", "isocntry", "region_nuts_codes", "value_label")) ) %>%
distinct_all () %>%
tibble::rownames_to_column() %>%
pivot_wider ( names_from = "survey_var",
values_from = "value_label" ) %>%
pivot_longer ( -all_of(c( "rowname", "isocntry", "region_nuts_codes")),
names_to = "survey_var",
values_to = "region_nuts_labels") %>%
filter (!is.na( .data$region_nuts_labels)) %>%
left_join ( code_table_validated %>%
select ( all_of(c("region_nuts_codes")),
starts_with("valid")
), by = "region_nuts_codes"
)
What are the problems? In Estonia we need to recode, because almost only the country code is present. For example,
Laene-Eesti (Western Estonia) is EE004
Lääne-Eesti. When a region cannot be associated with a valid NUTS3 code, create an invalid, technical code. For example, Tallinn should be EE00T
(5 characters). This makes sure that when we aggregate on NUTS0
, NUTS1
or NUTS2
levels, it will be correctly assigned to EE
, EE0
and EE00
.
ee <- all_labelling %>%
filter ( .data$isocntry == "EE")
ee
## # A tibble: 6,161 x 8
## rowname isocntry region_nuts_codes survey_var region_nuts_labels valid_2016
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 2 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 3 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 4 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 5 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 6 2914 EE EE ZA5877_reg~ Kirde-Eesti / Viru~ country
## 7 2915 EE EE ZA5877_reg~ Tallinn country
## 8 2915 EE EE ZA5877_reg~ Tallinn country
## 9 2915 EE EE ZA5877_reg~ Tallinn country
## 10 2915 EE EE ZA5877_reg~ Tallinn country
## # ... with 6,151 more rows, and 2 more variables: valid_2013 <chr>,
## # valid_2010 <chr>
Cyprus probably should be treated similarly to Malta and Luxembourg, for different reasons – because of the country’s division, the NUTS regions were not yet established.
write.csv(all_labelling, file.path('data-raw', 'eurostat-all_labelling.csv'), row.names=FALSE)
saveRDS(all_labelling, file.path('data-raw', 'eurostat-all_labelling.rds'))