Retrospective survey harmonization comes with many challenges, as we have shown in the introduction to this tutorial case study. In this example, we will work with Eurobarometer’s data.

Please use the development version of retroharmonize:

devtools::install_github("antaldaniel/retroharmonize")
## Warning: package 'dplyr' was built under R version 4.0.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'lubridate' was built under R version 4.0.4
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Get the Data

retroharmonize is not associated with Eurobarometer, its creators (Kantar), or its archivists (GESIS). We assume that you have acquired the necessary files from GESIS after carefully reading their terms and that you have placed them on a path that you call gesis_dir. The precise documentation of the data we use can be found in this supporting blogpost. To reproduce this blogpost, you will need ZA5877_v2-0-0.sav, ZA6595_v3-0-0.sav, ZA6861_v1-2-0.sav, ZA7488_v1-0-0.sav, ZA7572_v1-0-0.sav in a directory that you have named gesis_dir.

# Not run in the blogpost. In the repo we have a saved version.
climate_change_files <- c("ZA5877_v2-0-0.sav", "ZA6595_v3-0-0.sav",  "ZA6861_v1-2-0.sav", 
                          "ZA7488_v1-0-0.sav", "ZA7572_v1-0-0.sav")

eb_waves <- read_surveys(file.path(gesis_dir, climate_change_files), .f='read_spss')

if (dir.exists("data-raw")) {
  save ( eb_waves,  file = file.path("data-raw", "eb_climate_change_waves.rda") )
}
if ( file.exists( file.path("data-raw", "eb_climate_change_waves.rda") )) {
  load (file.path( "data-raw", "eb_climate_change_waves.rda" ) )
} else {
  load (file.path("..", "..",  "data-raw", "eb_climate_change_waves.rda") )
}

The eb_waves nested list contains five surveys imported from SPSS to the survey class of retroharmonize. The survey class is a data.frame that retains important metadata for further harmonization.

document_waves (eb_waves)
## # A tibble: 5 x 5
##   id            filename           ncol  nrow object_size
##   <chr>         <chr>             <int> <int>       <dbl>
## 1 ZA5877_v2-0-0 ZA5877_v2-0-0.sav   604 27919   139352456
## 2 ZA6595_v3-0-0 ZA6595_v3-0-0.sav   519 27718   119370440
## 3 ZA6861_v1-2-0 ZA6861_v1-2-0.sav   657 27901   151397528
## 4 ZA7488_v1-0-0 ZA7488_v1-0-0.sav   752 27339   169465928
## 5 ZA7572_v1-0-0 ZA7572_v1-0-0.sav   348 27655    80562432

Beware of the object sizes. If you work with many surveys, memory-efficient programming becomes imperative. We will be subsetting whenever possible.

Metadata analysis

As noted before, be prepared to work with nested lists. Each imported survey is nested as a data frame in the eb_waves list.

Metadata: Protocol Variables

Eurobarometer refers to certain metadata elements, like interviewee cooperation level or the date of a survey interview as protocol variables. Let’s start here. This will be our template to harmonize more and more aspects of the five surveys (which are, in fact, already harmonizations of about 30 surveys conducted in a single ‘wave’ in multiple countries.)

# select variables of interest from the metadata
eb_protocol_metadata <- eb_climate_metadata %>%
  filter ( .data$label_orig %in% c("date of interview") |
             .data$var_name_orig == "rowid")  %>%
  suggest_var_names( survey_program = "eurobarometer" )

# subset and harmonize these variables in all nested list items of 'waves' of surveys
interview_dates <- harmonize_var_names(eb_waves, 
                                       eb_protocol_metadata )

# apply similar data processing rules to same variables
interview_dates <- lapply (interview_dates, 
                      function (x) x %>% mutate ( date_of_interview = as_character(.data$date_of_interview) )
                      )

# join the individual survey tables into a single table 
interview_dates <- as_tibble ( Reduce (rbind, interview_dates) )

# Check the variable classes.

vapply(interview_dates, function(x) class(x)[1], character(1))
##             rowid date_of_interview 
##       "character"       "character"

This is our sample workflow for each block of variables.

  1. Get a unique identifier.
  2. Add other variables.
  3. Harmonize the variable names.
  4. Subset the data, leaving out anything that you do not harmonize in this block.
  5. Apply some normalization in a nested list.
  6. When the variables are harmonized to have the same names and class, merge them into a data.frame-like tibble object.

Now finish the harmonization. Wednesday, 31st October 2018 should become a Date type 2018-10-31.

require(lubridate)
harmonize_date <- function(x) {
  x <- tolower(as.character(x))
  x <- gsub("monday|tuesday|wednesday|thursday|friday|saturday|sunday|\\,|th|nd|rd|st", "", x)
  x <- gsub("decemberber", "december", x) # all those annoying real-life data problems!
  x <- stringr::str_trim (x, "both")
  x <- gsub("^0", "", x )
  x <- gsub("\\s\\s", "\\s", x)
  lubridate::dmy(x) 
}

interview_dates <- interview_dates %>%
  mutate ( date_of_interview = harmonize_date(.data$date_of_interview) )

vapply(interview_dates, function(x) class(x)[1], character(1))
##             rowid date_of_interview 
##       "character"            "Date"

To avoid duplication of row IDs in surveys that may not be unique in different surveys, we created a simple, sequential ID for each survey, including the ID of the original file.

set.seed(2021)
sample_n(interview_dates, 6)
## # A tibble: 6 x 2
##   rowid               date_of_interview
##   <chr>               <date>           
## 1 ZA7488_v1-0-0_7016  2018-10-28       
## 2 ZA7488_v1-0-0_19187 2018-11-02       
## 3 ZA6861_v1-2-0_1218  2017-03-18       
## 4 ZA6861_v1-2-0_4142  2017-03-21       
## 5 ZA7572_v1-0-0_12363 2019-04-17       
## 6 ZA7572_v1-0-0_8071  2019-04-18

After this type-conversion problem let’s see an issue when an original SPSS variable can have two meaningful R representations.

Metadata: Geographical information

Let’s continue with harmonizing geographical information in the files. In this example, var_name_suggested will contain the harmonized variable name. It is likely that you will have to make this call after carefully reading the original questionnaires and codebooks.

eb_regional_metadata <- eb_climate_metadata %>%
  filter ( grepl( "rowid|isocntry|^nuts$|^p7", .data$var_name_orig) ) %>%
  suggest_var_names( survey_program = "eurobarometer" )

harmonize_var_names() takes all variables in the subsetted, geographical metadata table, and brings them to the harmonized var_name_suggested name. The function subsets the surveys to avoid the presence of non-harmonized variables. All regional NUTS codes become geo in our case:

geography <- harmonize_var_names(eb_waves, 
                                 eb_regional_metadata)

If you are used to working with single survey files, you are likely to work in a tabular format, which easily converts into a data.frame-like object. In our example, we use tidyverse’s tibble. However, when working with longitudinal data, it is far simpler to work with nested lists, because tables usually have different dimensions (neither the rows corresponding to observations or the columns are the same across all survey files).

In the nested list, each list element is a single, tabular-format survey. In fact, the surveys are in retroharmonize’s survey class, which is a rich tibble that contains the metadata and the processing history of the survey.

The regional information in the Eurobarometer files is contained in the nuts variable. We want to keep both the original labels and values. The original values are the region’s codes, and the labels are the names. The easiest and fastest solution is the base R lapply loop.

# The character representation keeps the original SPSS coding.
geography_char <- lapply ( geography, 
                      function (x) x %>% mutate( across(everything(), as.character))
                      )  

# The label representation keeps the original SPSS labeling.
geography_label <- lapply ( geography, 
                      function (x) x %>% mutate( across(everything(), as_character ))
                      )  

Because each table has exactly the same columns, we can simply use rbind() and reduce the list to a modern data.frame, i.e. a tibble.

## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_cyprus_republic", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_greece_nuts_1_mod", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus", "region_finland_nuts_2")

Let’s see a dozen cases:

set.seed(2021)           # to see the same sample
sample_n(geography_chars, 12)
## # A tibble: 12 x 57
##    rowid        isocntry region_france_nut~ region_france_nu~ region_belgium_nu~
##    <chr>        <chr>    <chr>              <chr>             <chr>             
##  1 ZA7488_v1-0~ SI       99                 99                99                
##  2 ZA7488_v1-0~ PL       99                 99                99                
##  3 ZA6861_v1-2~ DK       99                 99                99                
##  4 ZA6861_v1-2~ FI       99                 99                99                
##  5 ZA7572_v1-0~ SE       99                 99                99                
##  6 ZA7572_v1-0~ IT       99                 99                99                
##  7 ZA6861_v1-2~ IE       99                 99                99                
##  8 ZA6861_v1-2~ RO       99                 99                99                
##  9 ZA7488_v1-0~ CY       99                 99                99                
## 10 ZA6595_v3-0~ HR       99                 99                99                
## 11 ZA7572_v1-0~ CZ       99                 99                99                
## 12 ZA6861_v1-2~ PT       99                 99                99                
## # ... with 52 more variables: region_belgium_nuts_1 <chr>,
## #   region_netherlands_nuts_2 <chr>, region_netherlands_nuts_1 <chr>,
## #   region_germany_nuts_1 <chr>, region_italy_nuts_2_mod <chr>,
## #   region_italy_nuts_1 <chr>, region_italy_nuts_1_mod <chr>,
## #   region_luxembourg <chr>, region_denmark_nuts_2 <chr>, region_ireland <chr>,
## #   region_united_kingdom_nuts_1 <chr>, region_united_kingdom_nuts_1_mod <chr>,
## #   region_greece_nuts_2 <chr>, region_greece_nuts_1 <chr>,
## #   region_spain_nuts_2 <chr>, region_spain_nuts_1 <chr>,
## #   region_spain_nuts_1_mod <chr>, region_portugal_nuts_2 <chr>,
## #   region_finland_nuts_3 <chr>, region_finland_nuts_2_2003_2006 <chr>,
## #   region_finland_nuts_2_2010 <chr>, region_sweden_nuts_2 <chr>,
## #   region_sweden_nuts_1 <chr>, region_austria_nuts_2 <chr>,
## #   region_austria_nuts_1 <chr>, region_cyprus_republic <chr>,
## #   region_czech_republic_nuts_2 <chr>, region_estonia <chr>,
## #   region_hungary_nuts_2 <chr>, region_hungary_nuts_1 <chr>,
## #   region_latvia_nuts_3 <chr>, region_lithuania_nuts_3 <chr>,
## #   region_poland_nuts_2 <chr>, region_poland_nuts_1 <chr>,
## #   region_slovakia_nuts_2 <chr>, region_slovenia_nuts_3 <chr>,
## #   region_slovenia_nuts_2 <chr>, region_bulgaria_nuts_3 <chr>,
## #   region_bulgaria_nuts_2 <chr>, region_romania_nuts_2 <chr>,
## #   region_romania_nuts_1 <chr>, region_croatia <chr>, region_nuts_codes <chr>,
## #   region_ireland_nuts_3 <chr>, region_ireland_nuts_2 <chr>,
## #   region_greece_nuts_1_mod <chr>, region_estonia_nuts_3 <chr>,
## #   region_croatia_nuts_3 <chr>, region_croatia_nuts_2 <chr>,
## #   regions_all_countries <chr>, region_cyprus <chr>,
## #   region_finland_nuts_2 <chr>

The idea is that we do similar variable harmonization block by block, and eventually we will join them together. Next step: socio-demography and weights.

## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_cyprus_republic", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_finland_nuts_2_2010", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_greece_nuts_1", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus")
## Joining, by = c("rowid", "isocntry", "region_france_nuts_2", "region_france_nuts_1", "region_belgium_nuts_2", "region_belgium_nuts_1", "region_netherlands_nuts_2", "region_netherlands_nuts_1", "region_germany_nuts_1", "region_italy_nuts_2_mod", "region_italy_nuts_1", "region_italy_nuts_1_mod", "region_luxembourg", "region_denmark_nuts_2", "region_united_kingdom_nuts_1", "region_united_kingdom_nuts_1_mod", "region_greece_nuts_2", "region_spain_nuts_2", "region_spain_nuts_1", "region_spain_nuts_1_mod", "region_portugal_nuts_2", "region_sweden_nuts_2", "region_sweden_nuts_1", "region_austria_nuts_2", "region_austria_nuts_1", "region_czech_republic_nuts_2", "region_hungary_nuts_2", "region_hungary_nuts_1", "region_latvia_nuts_3", "region_lithuania_nuts_3", "region_poland_nuts_2", "region_poland_nuts_1", "region_slovakia_nuts_2", "region_slovenia_nuts_3", "region_slovenia_nuts_2", "region_bulgaria_nuts_2", "region_romania_nuts_2", "region_romania_nuts_1", "region_nuts_codes", "region_ireland_nuts_3", "region_greece_nuts_1_mod", "region_estonia_nuts_3", "region_croatia_nuts_3", "region_croatia_nuts_2", "regions_all_countries", "region_cyprus", "region_finland_nuts_2")

Creating a code table

First let’s take all coding information from the contents of the SPSS variables:

require(tidyr)
## Loading required package: tidyr
characters <- geography_chars %>% 
  separate ( data = ., col = rowid,
             into = c("survey_id", "version", "unique_id"), 
             sep = "_") %>%
  select ( -all_of("unique_id")) %>%
  distinct_all() %>%
  pivot_longer( cols = -all_of(c( "survey_id", "version", "isocntry", "region_nuts_codes")), 
                names_to = "variable_name", 
                values_to = "character_value") %>%
  filter (  # remove inappropriate values - referring to missing values due to questionnaire filtering, etc.
            ! grepl( "^Inap|99", character_value) ) %>%
  filter (  # remove missing values from the original code, these are empty values from the full_join
            ! is.na(character_value))
set.seed(2021)
sample_n(characters, 6)
## # A tibble: 6 x 6
##   survey_id version isocntry region_nuts_codes variable_name     character_value
##   <chr>     <chr>   <chr>    <chr>             <chr>             <chr>          
## 1 ZA6595    v3-0-0  LV       LV003             region_latvia_nu~ 4              
## 2 ZA7488    v1-0-0  LT       LT006             regions_all_coun~ 7              
## 3 ZA6861    v1-2-0  DE-E     DEE               regions_all_coun~ 15             
## 4 ZA5877    v2-0-0  HU       HU31              region_hungary_n~ 3              
## 5 ZA6595    v3-0-0  IT       ITG               region_italy_nut~ 5              
## 6 ZA6595    v3-0-0  LT       LT00A             region_lithuania~ 1

The character values in fact contain no useful information, so we reduce this table to keep all variations of the region_nuts_codes.

characters %>% 
  select ( -.data$character_value ) %>%
  distinct_all(  )
## # A tibble: 2,966 x 5
##    survey_id version isocntry region_nuts_codes variable_name        
##    <chr>     <chr>   <chr>    <chr>             <chr>                
##  1 ZA5877    v2-0-0  BE       BE10              region_belgium_nuts_2
##  2 ZA5877    v2-0-0  BE       BE10              region_belgium_nuts_1
##  3 ZA5877    v2-0-0  BE       BE33              region_belgium_nuts_2
##  4 ZA5877    v2-0-0  BE       BE33              region_belgium_nuts_1
##  5 ZA5877    v2-0-0  BE       BE31              region_belgium_nuts_2
##  6 ZA5877    v2-0-0  BE       BE31              region_belgium_nuts_1
##  7 ZA5877    v2-0-0  BE       BE21              region_belgium_nuts_2
##  8 ZA5877    v2-0-0  BE       BE21              region_belgium_nuts_1
##  9 ZA5877    v2-0-0  BE       BE22              region_belgium_nuts_2
## 10 ZA5877    v2-0-0  BE       BE22              region_belgium_nuts_1
## # ... with 2,956 more rows

The variable name contains misleading information, so we should not rely on it, either.

Then let’s take their labels, when they exist:

labels <- geography_labels %>% 
  separate ( data = ., col = rowid,
             into = c("survey_id", "version", "unique_id"), 
             sep = "_") %>%
  select ( -all_of("unique_id") ) %>%
  distinct_all() %>%
  pivot_longer( cols = -all_of(c( "survey_id", "version", "isocntry", "region_nuts_codes")), 
                names_to = "variable_name", 
                values_to = "value_label") %>%
  filter ( ! is.na(  value_label) ) %>%
  filter ( ! grepl("^Inap", value_label) ) %>%
  rename ( # the label representation keeps the variable labels; the labels of the code are the names 
           region_nuts_names  = region_nuts_codes
           )

And join this information:

At this point, we see duplications, because sometimes we find coding information in one survey for both NUTS1 and NUTS2.

consistent_coding_information <- coding_information %>%
  filter ( nchar(region_nuts_codes) == 4 & grepl("2", .data$variable_name) | 
           nchar(region_nuts_codes) == 3 & grepl("1", .data$variable_name) | 
           nchar(region_nuts_codes) == 5 & grepl("3", .data$variable_name) ) 

# luckily there is no inconsistent information present 
inconsistent_coding_information <- consistent_coding_information %>%
  anti_join ( coding_information, 
              by = c("survey_id", "version", "isocntry", "region_nuts_codes",
                     "region_nuts_names", "variable_name", "value_label"))

The table has some problems.

Duplicate rows

Technical coding is needed

Malta

While Malta has two NUTS-3 regions, no regional information can be found in the GESIS datasets. We suggest to use MT00 and MT0 technical codes. It is not possible to reconstruct Malta’s data to MT001 and MT002.

mt_metadata <- coding_information %>% filter ( isocntry == "MT" ) %>%
  distinct_all()
mt_metadata
## # A tibble: 2 x 7
##   survey_id version isocntry region_nuts_codes region_nuts_nam~ variable_name   
##   <chr>     <chr>   <chr>    <chr>             <chr>            <chr>           
## 1 ZA6861    v1-2-0  MT       <NA>              MALTA            regions_all_cou~
## 2 ZA7572    v1-0-0  MT       <NA>              MALTA            regions_all_cou~
## # ... with 1 more variable: value_label <chr>

Luxembourg

Luxembourg does not have NUTS divisions, but the GESIS datafile contains regional information. This typology does not correspond with Luxembourgs ISO-3166-2 divisions, i.e. the 12 cantons. We suggest to use LU0 and LU00 and LU000 technical codes.

lu_metadata <- coding_information %>% filter ( isocntry == "LU" ) %>%
  distinct_all()

lu_metadata
## # A tibble: 32 x 7
##    survey_id version isocntry region_nuts_codes region_nuts_nam~ variable_name  
##    <chr>     <chr>   <chr>    <chr>             <chr>            <chr>          
##  1 ZA5877    v2-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  2 ZA5877    v2-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  3 ZA5877    v2-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  4 ZA5877    v2-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  5 ZA6595    v3-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  6 ZA6595    v3-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  7 ZA6595    v3-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  8 ZA6595    v3-0-0  LU       LU                LUXEMBOURG       region_luxembo~
##  9 ZA6861    v1-2-0  LU       LU                LUXEMBOURG       region_luxembo~
## 10 ZA6861    v1-2-0  LU       LU                LUXEMBOURG       regions_all_co~
## # ... with 22 more rows, and 1 more variable: value_label <chr>

Final tables

final_coding_information <- coding_information %>%
  mutate ( region_nuts_codes = case_when (
    .data$isocntry == "LU" ~ "LU000",  # we are certain because LU has no NUTS divisions
    .data$isocntry == "MT" ~ "MT00",    # Kantar does not follow Malta's NUTS3, 
    TRUE ~ .data$region_nuts_codes
  )) %>%
  filter ( nchar(region_nuts_codes) == 4 & grepl("2", .data$variable_name) | 
           nchar(region_nuts_codes) == 3 & grepl("1", .data$variable_name) | 
           nchar(region_nuts_codes) == 5 & grepl("3", .data$variable_name) |
           .data$isocntry %in% c("LU", "CY", "MT", "EE")) 
require(regions)
## Loading required package: regions
code_table_validated <- final_coding_information %>%
  mutate ( # Whatever nuts codes we have, let's see if they are valid in the last 3 NUTS definitions
           # Bug fix for regions: return character vector
           valid_2016 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2016)),
           valid_2013 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2013)),
           valid_2010 = as.character(regions::validate_geo_code(.data$region_nuts_codes, nuts_year = 2010))
           )

As we can see we have all sorts of problems here. The regional coding is on either NUTS1, NUTS2 or NUTS3 level, and either follows the NUTS2010 or the NUTS2013 typology’s coding.

set.seed(1997)
sample_n(code_table_validated %>% 
           select ( all_of(c("region_nuts_codes")), 
                    starts_with("valid")
           ), 12)
## # A tibble: 12 x 4
##    region_nuts_codes valid_2016   valid_2013   valid_2010  
##    <chr>             <chr>        <chr>        <chr>       
##  1 UKC               nuts_level_1 nuts_level_1 nuts_level_1
##  2 UKN               nuts_level_1 nuts_level_1 nuts_level_1
##  3 FR30              invalid      nuts_level_2 nuts_level_2
##  4 ES62              nuts_level_2 nuts_level_2 nuts_level_2
##  5 HU32              nuts_level_2 nuts_level_2 nuts_level_2
##  6 UKG               nuts_level_1 nuts_level_1 nuts_level_1
##  7 FR30              invalid      nuts_level_2 nuts_level_2
##  8 SI021             invalid      invalid      nuts_level_3
##  9 BG344             nuts_level_3 nuts_level_3 nuts_level_3
## 10 CZ02              nuts_level_2 nuts_level_2 nuts_level_2
## 11 LT007             invalid      nuts_level_3 nuts_level_3
## 12 FR21              invalid      nuts_level_2 nuts_level_2
all_labelling <- code_table_validated %>%
  select ( -all_of(c("version" )) ) %>%
  distinct_all() %>%
  #filter ( isocntry == "BE") %>%
  unite ( col = survey_var, c("survey_id", "variable_name")) %>%
  select ( all_of(c("survey_var", "isocntry", "region_nuts_codes", "value_label")) ) %>%
  distinct_all () %>%
  tibble::rownames_to_column() %>%
  pivot_wider ( names_from = "survey_var", 
                values_from = "value_label" ) %>%
  pivot_longer ( -all_of(c( "rowname", "isocntry", "region_nuts_codes")), 
                 names_to = "survey_var",
                 values_to = "region_nuts_labels") %>%
  filter (!is.na( .data$region_nuts_labels)) %>%
  left_join ( code_table_validated %>%
                select ( all_of(c("region_nuts_codes")),
                         starts_with("valid")
                         ), by = "region_nuts_codes"
  )

What are the problems? In Estonia we need to recode, because almost only the country code is present. For example,
Laene-Eesti (Western Estonia) is EE004 Lääne-Eesti. When a region cannot be associated with a valid NUTS3 code, create an invalid, technical code. For example, Tallinn should be EE00T (5 characters). This makes sure that when we aggregate on NUTS0, NUTS1 or NUTS2 levels, it will be correctly assigned to EE, EE0 and EE00.

ee <- all_labelling %>%
  filter ( .data$isocntry == "EE")

ee
## # A tibble: 6,161 x 8
##    rowname isocntry region_nuts_codes survey_var  region_nuts_labels  valid_2016
##    <chr>   <chr>    <chr>             <chr>       <chr>               <chr>     
##  1 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  2 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  3 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  4 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  5 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  6 2914    EE       EE                ZA5877_reg~ Kirde-Eesti / Viru~ country   
##  7 2915    EE       EE                ZA5877_reg~ Tallinn             country   
##  8 2915    EE       EE                ZA5877_reg~ Tallinn             country   
##  9 2915    EE       EE                ZA5877_reg~ Tallinn             country   
## 10 2915    EE       EE                ZA5877_reg~ Tallinn             country   
## # ... with 6,151 more rows, and 2 more variables: valid_2013 <chr>,
## #   valid_2010 <chr>

Cyprus probably should be treated similarly to Malta and Luxembourg, for different reasons – because of the country’s division, the NUTS regions were not yet established.

write.csv(all_labelling, file.path('data-raw', 'eurostat-all_labelling.csv'), row.names=FALSE)
saveRDS(all_labelling, file.path('data-raw', 'eurostat-all_labelling.rds'))