Tagging variables with keywords

library(eurobarometer)
library(tidyverse)
library(tibble)
library(knitr)
library(kableExtra)
library(tidytext)
library(janitor)
library(stringi)

An important aspect of survey data harmonization - in addition to standardizing response coding and value labels - is the identification of variables that measure the same thing, and standardizing the names of these variables. The process is faciliated by the fact that variables typically have labels that succintly describe the content of the variable, and sometimes also contain information about the coding of responses (e.g., length of scale), the unit (e.g., age in years or categories) and whether the variable contains “original” respondent responses or some transformation of responses (e.g. recoding or index).

Fortunately, in many multi-wave surveys, such as the Eurobarometer, there is a degree of consistency in how variables are labelled in different editions of the survey, and most survey projects try to follow some labelling conventions.

Unfortunately, consistency is limited and conventions are not followed strictly, which makes standardizing labels a semi-manual task. Some of these issues stem from the limits in the number of characters of variable labels in common statistical software packages (e.g., SPSS), which leads to non-standard abbreviations.

This vignette outlines a procedure for arriving at standardized variable names via tokenization of normalized variable labels (var_label_norm), which are created by label_normalize as part of gesis_metadata_create. The logic is similar to text analysis, in that a document-term matrix is created with each variable label treated as a (very short) document, and terms treated as keywords. variables that contain the same sets of keywords can be - with some margin of error - assumed to correspond to the same or equivalent survey questions.

Procedure

  1. Read survey and variable metadata, match variables to variable groups as defined by GESIS-ZACAT, filter out technical and protocol variables,

  2. Standardize the spelling of certain often abbreviated words (e.g., ‘gov’, ‘govnmt’, and ‘govmnt’ for ‘government’),

  3. Tokenize,

  4. Eliminate stop words (but be careful to not eliminate too many, e.g. ‘right’ or ‘working’),

  5. Create a document-term matrix (actually, variable label - keyword matrix),

  6. Sort the columns according to the declining frequency of the keywords,

  7. Create standardized variable labels by concatenating keywords.

0. Survey and variable metadata

Variable metadata

The table contains metadata from 97 Eurobarometer files (each file corresponding to one EB round). The metadata are extracted from SPSS files with gesis_metadata_create.

metadata_database <- readRDS( file.path('../data-raw', 'eb_metadata_database_20200628.rds')) %>%
  mutate(archive_id = substr(filename, 1, 6))

Waves - Archive IDs

Mapping of GESIS Archive IDs (as used in file names) to wave numbers, from EB/44.2bis (ZA2828) to EB/92.1 (ZA7579) - 156 EB waves altogether.

eb_names <- rio::import("../data-raw/eb_waves_filenames.xlsx") %>%
  separate(x, c("archive_id", "x2", "title"), sep = ":") %>%
  separate(x2, c("wave", "dates"), sep = " \\(|\\)") %>%
  mutate(wave = gsub("Eurobarometer | ", "", wave))

eb_names %>%
  select(archive_id, wave, dates) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
archive_id wave dates
ZA7579 92.1 2019
ZA7576 91.5 June-July 2019
ZA7575 91.4 2019
ZA7572 91.3 2019
ZA7562 91.2 2019
ZA7561 91.1 February-March 2019

Variable groups

Mapping of variable labels to variable groups, as defined in GESIS ZACAT, for 51 waves from EB 79.1 (ZA5687) through EB 92.1 (ZA7579).

varlabs <- rio::import("../data-raw/eb_variable_groups_zacat.xlsx", sheet = 4) %>%
  mutate(wave_desc = ifelse(substr(var_label_zacat, 1, 13) == "Eurobarometer", var_label_zacat, NA),
         var_label_zacat = ifelse(substr(var_label_zacat, 1, 13)=="Eurobarometer", NA, var_label_zacat),
         # group names are writte in camel case and != toupper(group names)
         group = ifelse(var_label_zacat != toupper(var_label_zacat), var_label_zacat, NA),
         # exceptions from the rule identifying group names
         group = ifelse(substr(group, 1, 9) == "OWNERSHIP", NA, group), 
         group = ifelse(substr(group, 1, 19) == "POSITIVE EU RESULTS", NA, group),
         group = ifelse(substr(group, 1, 24) == "OPINION LEADERSHIP INDEX", NA, group),
         var_label_zacat = ifelse(!is.na(group), NA, var_label_zacat)) %>%
  fill(wave_desc, group) %>%
  drop_na(group, var_label_zacat) %>%
  mutate(wave = substr(wave_desc, 1, 18),
         wave = gsub("^Eurobarometer ", "", wave),
         var_label_zacat = gsub("^W[0-9]{1,2}[A-Z] |^W3A/W4A ", "", var_label_zacat)) %>%
  left_join(eb_names) %>%
  select(wave, archive_id, group, var_label_zacat, everything())

varlabs %>%
  select(archive_id, wave, group, var_label_zacat) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
archive_id wave group var_label_zacat
ZA7579 92.1 Archive and survey ID variables ARCHIVE STUDY NUMBER - DISTRIBUTOR
ZA7579 92.1 Archive and survey ID variables ARCHIVE STUDY NUMBER - PUBLISHER
ZA7579 92.1 Archive and survey ID variables DIGITAL OBJECT IDENTIFIER
ZA7579 92.1 Archive and survey ID variables GESIS ARCHIVE VERSION & DATE
ZA7579 92.1 Archive and survey ID variables DATASET EDITION
ZA7579 92.1 Archive and survey ID variables SURVEY IDENTIFICATION
# vector of archive IDs
archive_ids <- varlabs %>% distinct(archive_id) %>% pull(archive_id)

length(archive_ids)
#> [1] 51

1. Standardize the spelling

All procedures below applied to 51 waves, for which all metadata are available.

1.1. Translation table …

dict <- rio::import("../data-raw/var_lab_dict.xlsx", 
                    sheet = "dict") %>%
  gather(id, from, 2:6) %>%
  select(from, to) %>%
  drop_na(from)

head(dict) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
from to
^cap_ common-agricultural-policy_
cap common-agricultural-policy
united_kingdom united-kingdom
czech_republic czechia
great_britain great-britain
united_germany united-germany

1.2. … applied to metadata

# this part now became very fast with stri_replace_all_regex

dat2 <- metadata_database %>%
  # filter only waves for which variable groups are available
  filter(archive_id %in% archive_ids,
         # exclude this strange variable
         var_label_orig != "filename") %>%
  # keep one row per variable
  distinct(filename, qb, var_name_orig, var_name_orig, var_label_orig, var_label_norm, archive_id) %>%
  # original variable labels in the data and in GESIS ZACAT are not exactly the same 
  # --> need some cleaning
  mutate(var_label_orig2 = gsub("^W[0-9]{1,2}", "", var_label_orig),
         var_label_orig2 = gsub("^W[0-9]{1,2}[A-Z]", "", var_label_orig),
         var_label_orig2 = gsub("^[0-9]{1-3} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z][0-9]{1,2}[A-Z] ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2}[A-Z] ", "", var_label_orig2),
         var_label_orig2 = gsub("^[0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z][0-9] [0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^W3A/W4A ", "", var_label_orig2)) %>%
  # join with table with variable groups
  left_join(varlabs, by = c("var_label_orig2" = "var_label_zacat", "archive_id")) %>%
  # remove variables in the following groups
  filter(!group %in% c("Archive and survey ID variables",
                       "Country group variables",
                       "Nationality (Q1)",
                       "Protocol variables (P)",
                       "Standard nation ID variables",
                       "Weight variables (W)",
                       "Nation ID variables",
                       "Protocol variables",
                       "Weight variables",
                       "Archive and ID variables",
                       "Household Weights"),
         # remove recoded variables
         !grepl("recoded|(rec)", var_label_norm)) %>%
  mutate(var_id = paste0(filename, " ", var_name_orig),
         # leftover question numbers at the beginning of value labels
         var_label_norm1 = gsub("^w[0-9]{1,2}", "", var_label_norm),
         var_label_norm1 = gsub("^[0-9]{1-3}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z][0-9]{2}[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[0-9]{1,2}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z][0-9]_[0-9]{1,2}_", "", var_label_norm1)) %>%
  # apply the translation table
  mutate(var_label_norm1 = stri_replace_all_regex(var_label_norm1,
                                                  pattern = dict$from,
                                                  replacement = dict$to,
                                                  vectorize_all = FALSE)) %>%
  # replace underscore with space
  mutate(text = gsub("_", " ", var_label_norm1)) %>%
  ungroup() %>%
  # select the needed variables
  select(filename, qb, var_label_orig, var_label_norm, var_label_norm1, text, var_id, group)

2 & 3. Tokenize & Eliminate stopwords

# list of country codes to be converted to country names
eu_countries <- c(
  "at", "be", "bg", "cy", "cz", "dk", "de", "ee", "es", "fr",
  "fi", "gb", "gr", "hr", "hu", "ie", "it", "lt", "lv", "lu",
  "mt", "nl", "pl", "pt", "ro", "se", "si", "sk", "tr")

# !!! the problem here is that dk sometimes means "don't know" and sometimes Demnark !!!
# this needs to be treated in gesis_metadata_create, where the
# user defined missings are separately analyzed, they dk should 
# be converted there to do_not_know.

# it's not clear hot to do this, because both types of DK appear at the end of labels e.g. in
# mentioned/not mentioned questions, and they look very much the same

# ---- but normally one should be in attr(x, "na_values") 

tidy <- dat2 %>%
  # tokenize
  unnest_tokens(word, text, 
                token = "regex", pattern = " |  ") %>%
  # eliminate stop words, with some exceptions
  anti_join(stop_words %>%
              filter(!word %in% c("working", "right", "other")
                     )) %>%
  # eliminate numbers
  filter(!grepl("\\b\\d+\\b", word)) %>%
  # convert country codes to country names
  mutate(
    country = ifelse(word %in% eu_countries, word, NA),
         country = countrycode::countrycode(
           toupper(country), "iso2c", "country.name"),
    country = tolower(country),
    word = ifelse(!is.na(country), country, word),
    word = ifelse(word == "united kingdom", "united-kingdom", word),
    word = ifelse(word == "tcc", "turkish-cypriot-community", word)
    ) %>%
  select(-country)
tidy %>%
  select(var_id, var_label_norm1, word) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
var_id var_label_norm1 word
ZA5687_v3-0-0.sav d11 age_exact age
ZA5687_v3-0-0.sav d11 age_exact exact
ZA5687_v3-0-0.sav qb9c_9 bribe_amount_in_euro_building_permits bribe
ZA5687_v3-0-0.sav qb9c_9 bribe_amount_in_euro_building_permits amount
ZA5687_v3-0-0.sav qb9c_9 bribe_amount_in_euro_building_permits euro
ZA5687_v3-0-0.sav qb9c_9 bribe_amount_in_euro_building_permits building
tidy %>%
  count(word) %>%
  arrange(desc(n)) %>% 
  head(., 10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
word n
eu 2934
sum 920
spontaneous 783
turkish-cypriot-community 703
denmark 700
2nd 697
other 617
nationality 558
country 530
internet 530

Note: this Denmark is probably a “don’t know”

4. Create a document-term matrix

Filter out labels that have words that appear in < 5 different waves. If we want to create a trend file, we’re only interested in variables that are repeated often.

tidy_small_2 <- tidy %>%
  group_by(word) %>%
  # number of distinct file names (waves) per word
  mutate(n_wave = n_distinct(filename)) %>%
  # flag words that appear in < 5 different files
  mutate(to_drop = as.numeric(n_wave < 10)) %>%
  # group by variable ID (file x variable name)
  group_by(var_id) %>%
  # copy to_drop across words in the same variable label
  mutate(to_drop = max(to_drop, na.rm = TRUE)) %>%
  # drop variable labels with the to_drop identifier
  filter(to_drop != 1) %>%
  # remove auxiliary variables
  select(-to_drop, -n_wave) %>%
  ungroup()

dtm <- tidy_small_2 %>%
  distinct(var_id, var_label_norm1, word) %>%
  mutate(n = word) %>%
  spread(word, n)

dim(dtm)
#> [1] 6632  429

5. Sort columns according to the declining frequency of the keywords

# get the order of variables according to declining proportion of missing value
# (declining popularity of keywords)
order <- dtm %>%
  summarise_all(funs(sum(is.na(.))/length(.))) %>%
  t() %>% as.vector()

names(order) <- names(dtm)

6. Create standardized variable labels by concatenating keywords

Standardized variable labels var_label_std are concatenated keywords, where the keywords occur in the order of their frequency in the entire corpus of variable labels. As a consequence, more common - likely broader - terms will come first, and less common - more specific - terms will come later.

dtm2 <- dtm %>%
  select( names(sort(order)) ) %>%
  mutate_at(vars(3:ncol(dtm)), 
            funs(ifelse(is.na(.), "", .))
            ) %>%
  unite("var_label_std", 3:ncol(dtm)) %>%
  mutate(var_label_std = gsub(
    "(\\_)\\1+", "\\1", var_label_std) ) %>%
  mutate(var_label_std = gsub(
    "^_|_$", "", var_label_std) )

dim(dtm2)
#> [1] 6632    3

These standardized variable names still have problems that need to be fixed.

dtm2 %>% 
  head(., 25) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
var_label_norm1 var_id var_label_std
age_education ZA5687_v3-0-0.sav d8 age_education
age_education ZA5688_v6-0-0.sav d8 age_education
age_education ZA5689_v2-0-0.sav d8 age_education
age_education ZA5876_v2-0-0.sav d8 age_education
age_education ZA5877_v2-0-0.sav d8 age_education
age_education ZA5913_v2-0-0.sav d8 age_education
age_education ZA5928_v3-0-0.sav d8 age_education
age_education ZA5929_v3-0-0.sav d8 age_education
age_education ZA5932_v3-0-0.sav d8 age_education
age_education ZA5964_v2-0-0.sav d8 age_education
age_education ZA5998_v2-0-0.sav d8 age_education
age_education ZA6595_v3-0-0.sav d8 age_education
age_education ZA6642_v4-0-0.sav d8 age_education
age_education ZA6643_v4-0-0.sav d8 age_education
age_education ZA6694_v2-0-0.sav d8 age_education
age_education ZA6695_v1-1-0.sav d8 age_education
age_education ZA6697_v1-1-0.sav d8 age_education
age_education ZA6788_v1-5-0.sav d8 age_education
age_education ZA6861_v1-2-0.sav d8 age_education
age_education ZA6863_v1-0-0.sav d8 age_education
age_education ZA6925_v1-0-0.sav d8 age_education
age_education ZA6927_v1-0-0.sav d8 age_education
age_education ZA6928_v1-0-0.sav d8 age_education
age_education ZA6939_v2-0-0.sav d8 age_education
age_education ZA6963_v1-0-0.sav d8 age_education

Print the most common standardized variable labels.

dtm2 %>%
  count(var_label_std) %>%
  arrange(desc(n)) %>% 
  head(., 25) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
var_label_std n
generation 132
internet_frequency 74
generation_age 63
age_education 38
age_exact 38
community_type 38
gender 38
household-composition_aged 38
household-composition_aged_gt 38
household-composition_aged_lithuania 38
occupation_respondent 38
occupation_respondent_last-job 38
paying_bills_difficulties 38
status_marital 38
internet_frequency_home 37
occupation_respondent_c14_scale 36
life_satisfaction 35
political-discussion_matters_european 35
political-discussion_matters_local 35
political-discussion_matters_national 35
nationality_denmark 34
left-right_placement 32
country_voice_counts 31
cat_assessment_social-class 30
eu_voice_counts 30

Application: political attitudes

Filter on substrings of standardized variable labels var_label_std, and count the repetitions across datasets.

# filter on var_label_std
dtm2 %>%
  filter(grepl("trust|satisf|left-right|interest|class", var_label_std)) %>% 
  count(var_label_std) %>%
  arrange(desc(n)) %>%
  head(., 25) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )
var_label_std n
life_satisfaction 35
left-right_placement 32
cat_assessment_social-class 30
country_democracy_satisfaction 18
democracy_satisfaction_european-union 18
trust_european-union 16
trust_national-government 16
trust_national-parliament 16
trust_political-parties 16
trust_regional-local-authorities 15
trust_european-central-bank 14
trust_european-commission 14
trust_european-parliament 14
trust_united-nations 13
turkish-cypriot-community_trust_european-union 10
turkish-cypriot-community_trust_united-nations 10
trust_army 9
trust_police 9
turkish-cypriot-community_trust_political-parties 9
assessment_social-class 8
trust_public_administration 7
internet_trust 6
trust_radio 6
trust_television 6
trust_social_online_networks 5