Tagging variables with keywords

An important aspect of survey data harmonization - in addition to standardizing response coding and value labels - is the identification of variables that measure the same thing, and standardizing the names of these variables. The process is faciliated by the fact that variables typically have labels that succintly describe the content of the variable, and sometimes also contain information about the coding of responses (e.g., length of scale), the unit (e.g., age in years or categories) and whether the variable contains “original” respondent responses or some transformation of responses (e.g. recoding or index).

Fortunately, in many multi-wave surveys, such as the Eurobarometer, there is a degree of consistency in how variables are labelled in different editions of the survey, and most survey projects try to follow some labelling conventions.

Unfortunately, consistency is limited and conventions are not followed strictly, which makes standardizing labels a semi-manual task. Some of these issues stem from the limits in the number of characters of variable labels in common statistical software packages (e.g., SPSS), which leads to non-standard abbreviations.

This vignette outlines a procedure for arriving at standardized variable names via tokenization of normalized variable labels (var_label_norm), which are created by label_normalize as part of gesis_metadata_create. The logic is similar to text analysis, in that a document-term matrix is created with each variable label treated as a (very short) document, and terms treated as keywords. variables that contain the same sets of keywords can be - with some margin of error - assumed to correspond to the same or equivalent survey questions.

Procedure

Read survey and variable metadata, match variables to variable groups as defined by GESIS-ZACAT, filter out technical and protocol variables,
Standardize the spelling of certain often abbreviated words (e.g., ‘gov’, ‘govnmt’, and ‘govmnt’ for ‘government’),
Tokenize,
Eliminate stop words (but be careful to not eliminate too many, e.g. ‘right’ or ‘working’),
Create a document-term matrix (actually, variable label - keyword matrix),
Sort the columns according to the declining frequency of the keywords,
Create standardized variable labels by concatenating keywords.

0. Survey and variable metadata

Variable metadata

The table contains metadata from 97 Eurobarometer files (each file corresponding to one EB round). The metadata are extracted from SPSS files with gesis_metadata_create.

metadata_database <- readRDS( file.path('../data-raw', 'eb_metadata_database_20200628.rds')) %>%
  mutate(archive_id = substr(filename, 1, 6))

Waves - Archive IDs

Mapping of GESIS Archive IDs (as used in file names) to wave numbers, from EB/44.2bis (ZA2828) to EB/92.1 (ZA7579) - 156 EB waves altogether.

eb_names <- rio::import("../data-raw/eb_waves_filenames.xlsx") %>%
  separate(x, c("archive_id", "x2", "title"), sep = ":") %>%
  separate(x2, c("wave", "dates"), sep = " \\(|\\)") %>%
  mutate(wave = gsub("Eurobarometer | ", "", wave))

eb_names %>%
  select(archive_id, wave, dates) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

archive_id	wave	dates
ZA7579	92.1	2019
ZA7576	91.5	June-July 2019
ZA7575	91.4	2019
ZA7572	91.3	2019
ZA7562	91.2	2019
ZA7561	91.1	February-March 2019

Variable groups

Mapping of variable labels to variable groups, as defined in GESIS ZACAT, for 51 waves from EB 79.1 (ZA5687) through EB 92.1 (ZA7579).

varlabs <- rio::import("../data-raw/eb_variable_groups_zacat.xlsx", sheet = 4) %>%
  mutate(wave_desc = ifelse(substr(var_label_zacat, 1, 13) == "Eurobarometer", var_label_zacat, NA),
         var_label_zacat = ifelse(substr(var_label_zacat, 1, 13)=="Eurobarometer", NA, var_label_zacat),
         # group names are writte in camel case and != toupper(group names)
         group = ifelse(var_label_zacat != toupper(var_label_zacat), var_label_zacat, NA),
         # exceptions from the rule identifying group names
         group = ifelse(substr(group, 1, 9) == "OWNERSHIP", NA, group), 
         group = ifelse(substr(group, 1, 19) == "POSITIVE EU RESULTS", NA, group),
         group = ifelse(substr(group, 1, 24) == "OPINION LEADERSHIP INDEX", NA, group),
         var_label_zacat = ifelse(!is.na(group), NA, var_label_zacat)) %>%
  fill(wave_desc, group) %>%
  drop_na(group, var_label_zacat) %>%
  mutate(wave = substr(wave_desc, 1, 18),
         wave = gsub("^Eurobarometer ", "", wave),
         var_label_zacat = gsub("^W[0-9]{1,2}[A-Z] |^W3A/W4A ", "", var_label_zacat)) %>%
  left_join(eb_names) %>%
  select(wave, archive_id, group, var_label_zacat, everything())

varlabs %>%
  select(archive_id, wave, group, var_label_zacat) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

archive_id	wave	group	var_label_zacat
ZA7579	92.1	Archive and survey ID variables	ARCHIVE STUDY NUMBER - DISTRIBUTOR
ZA7579	92.1	Archive and survey ID variables	ARCHIVE STUDY NUMBER - PUBLISHER
ZA7579	92.1	Archive and survey ID variables	DIGITAL OBJECT IDENTIFIER
ZA7579	92.1	Archive and survey ID variables	GESIS ARCHIVE VERSION & DATE
ZA7579	92.1	Archive and survey ID variables	DATASET EDITION
ZA7579	92.1	Archive and survey ID variables	SURVEY IDENTIFICATION

# vector of archive IDs
archive_ids <- varlabs %>% distinct(archive_id) %>% pull(archive_id)

length(archive_ids)
#> [1] 51

1. Standardize the spelling

All procedures below applied to 51 waves, for which all metadata are available.

1.1. Translation table …

dict <- rio::import("../data-raw/var_lab_dict.xlsx", 
                    sheet = "dict") %>%
  gather(id, from, 2:6) %>%
  select(from, to) %>%
  drop_na(from)

head(dict) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

from	to
^cap_	common-agricultural-policy_
cap	common-agricultural-policy
united_kingdom	united-kingdom
czech_republic	czechia
great_britain	great-britain
united_germany	united-germany

1.2. … applied to metadata

# this part now became very fast with stri_replace_all_regex

dat2 <- metadata_database %>%
  # filter only waves for which variable groups are available
  filter(archive_id %in% archive_ids,
         # exclude this strange variable
         var_label_orig != "filename") %>%
  # keep one row per variable
  distinct(filename, qb, var_name_orig, var_name_orig, var_label_orig, var_label_norm, archive_id) %>%
  # original variable labels in the data and in GESIS ZACAT are not exactly the same 
  # --> need some cleaning
  mutate(var_label_orig2 = gsub("^W[0-9]{1,2}", "", var_label_orig),
         var_label_orig2 = gsub("^W[0-9]{1,2}[A-Z]", "", var_label_orig),
         var_label_orig2 = gsub("^[0-9]{1-3} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z][0-9]{1,2}[A-Z] ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2}[A-Z] ", "", var_label_orig2),
         var_label_orig2 = gsub("^[0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^[A-Z][0-9] [0-9]{1,2} ", "", var_label_orig2),
         var_label_orig2 = gsub("^W3A/W4A ", "", var_label_orig2)) %>%
  # join with table with variable groups
  left_join(varlabs, by = c("var_label_orig2" = "var_label_zacat", "archive_id")) %>%
  # remove variables in the following groups
  filter(!group %in% c("Archive and survey ID variables",
                       "Country group variables",
                       "Nationality (Q1)",
                       "Protocol variables (P)",
                       "Standard nation ID variables",
                       "Weight variables (W)",
                       "Nation ID variables",
                       "Protocol variables",
                       "Weight variables",
                       "Archive and ID variables",
                       "Household Weights"),
         # remove recoded variables
         !grepl("recoded|(rec)", var_label_norm)) %>%
  mutate(var_id = paste0(filename, " ", var_name_orig),
         # leftover question numbers at the beginning of value labels
         var_label_norm1 = gsub("^w[0-9]{1,2}", "", var_label_norm),
         var_label_norm1 = gsub("^[0-9]{1-3}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z][0-9]{2}[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[0-9]{1,2}_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z]_", "", var_label_norm1),
         var_label_norm1 = gsub("^[a-z][0-9]_[0-9]{1,2}_", "", var_label_norm1)) %>%
  # apply the translation table
  mutate(var_label_norm1 = stri_replace_all_regex(var_label_norm1,
                                                  pattern = dict$from,
                                                  replacement = dict$to,
                                                  vectorize_all = FALSE)) %>%
  # replace underscore with space
  mutate(text = gsub("_", " ", var_label_norm1)) %>%
  ungroup() %>%
  # select the needed variables
  select(filename, qb, var_label_orig, var_label_norm, var_label_norm1, text, var_id, group)

2 & 3. Tokenize & Eliminate stopwords

# list of country codes to be converted to country names
eu_countries <- c(
  "at", "be", "bg", "cy", "cz", "dk", "de", "ee", "es", "fr",
  "fi", "gb", "gr", "hr", "hu", "ie", "it", "lt", "lv", "lu",
  "mt", "nl", "pl", "pt", "ro", "se", "si", "sk", "tr")

# !!! the problem here is that dk sometimes means "don't know" and sometimes Demnark !!!
# this needs to be treated in gesis_metadata_create, where the
# user defined missings are separately analyzed, they dk should 
# be converted there to do_not_know.

# it's not clear hot to do this, because both types of DK appear at the end of labels e.g. in
# mentioned/not mentioned questions, and they look very much the same

# ---- but normally one should be in attr(x, "na_values") 

tidy <- dat2 %>%
  # tokenize
  unnest_tokens(word, text, 
                token = "regex", pattern = " |  ") %>%
  # eliminate stop words, with some exceptions
  anti_join(stop_words %>%
              filter(!word %in% c("working", "right", "other")
                     )) %>%
  # eliminate numbers
  filter(!grepl("\\b\\d+\\b", word)) %>%
  # convert country codes to country names
  mutate(
    country = ifelse(word %in% eu_countries, word, NA),
         country = countrycode::countrycode(
           toupper(country), "iso2c", "country.name"),
    country = tolower(country),
    word = ifelse(!is.na(country), country, word),
    word = ifelse(word == "united kingdom", "united-kingdom", word),
    word = ifelse(word == "tcc", "turkish-cypriot-community", word)
    ) %>%
  select(-country)

tidy %>%
  select(var_id, var_label_norm1, word) %>%
  head() %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

var_id	var_label_norm1	word
ZA5687_v3-0-0.sav d11	age_exact	age
ZA5687_v3-0-0.sav d11	age_exact	exact
ZA5687_v3-0-0.sav qb9c_9	bribe_amount_in_euro_building_permits	bribe
ZA5687_v3-0-0.sav qb9c_9	bribe_amount_in_euro_building_permits	amount
ZA5687_v3-0-0.sav qb9c_9	bribe_amount_in_euro_building_permits	euro
ZA5687_v3-0-0.sav qb9c_9	bribe_amount_in_euro_building_permits	building

tidy %>%
  count(word) %>%
  arrange(desc(n)) %>% 
  head(., 10) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

word	n
eu	2934
sum	920
spontaneous	783
turkish-cypriot-community	703
denmark	700
2nd	697
other	617
nationality	558
country	530
internet	530

Note: this Denmark is probably a “don’t know”

4. Create a document-term matrix

Filter out labels that have words that appear in < 5 different waves. If we want to create a trend file, we’re only interested in variables that are repeated often.

tidy_small_2 <- tidy %>%
  group_by(word) %>%
  # number of distinct file names (waves) per word
  mutate(n_wave = n_distinct(filename)) %>%
  # flag words that appear in < 5 different files
  mutate(to_drop = as.numeric(n_wave < 10)) %>%
  # group by variable ID (file x variable name)
  group_by(var_id) %>%
  # copy to_drop across words in the same variable label
  mutate(to_drop = max(to_drop, na.rm = TRUE)) %>%
  # drop variable labels with the to_drop identifier
  filter(to_drop != 1) %>%
  # remove auxiliary variables
  select(-to_drop, -n_wave) %>%
  ungroup()

dtm <- tidy_small_2 %>%
  distinct(var_id, var_label_norm1, word) %>%
  mutate(n = word) %>%
  spread(word, n)

dim(dtm)
#> [1] 6632  429

5. Sort columns according to the declining frequency of the keywords

# get the order of variables according to declining proportion of missing value
# (declining popularity of keywords)
order <- dtm %>%
  summarise_all(funs(sum(is.na(.))/length(.))) %>%
  t() %>% as.vector()

names(order) <- names(dtm)

6. Create standardized variable labels by concatenating keywords

Standardized variable labels var_label_std are concatenated keywords, where the keywords occur in the order of their frequency in the entire corpus of variable labels. As a consequence, more common - likely broader - terms will come first, and less common - more specific - terms will come later.

dtm2 <- dtm %>%
  select( names(sort(order)) ) %>%
  mutate_at(vars(3:ncol(dtm)), 
            funs(ifelse(is.na(.), "", .))
            ) %>%
  unite("var_label_std", 3:ncol(dtm)) %>%
  mutate(var_label_std = gsub(
    "(\\_)\\1+", "\\1", var_label_std) ) %>%
  mutate(var_label_std = gsub(
    "^_|_$", "", var_label_std) )

dim(dtm2)
#> [1] 6632    3

These standardized variable names still have problems that need to be fixed.

dtm2 %>% 
  head(., 25) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

var_label_norm1	var_id	var_label_std
age_education	ZA5687_v3-0-0.sav d8	age_education
age_education	ZA5688_v6-0-0.sav d8	age_education
age_education	ZA5689_v2-0-0.sav d8	age_education
age_education	ZA5876_v2-0-0.sav d8	age_education
age_education	ZA5877_v2-0-0.sav d8	age_education
age_education	ZA5913_v2-0-0.sav d8	age_education
age_education	ZA5928_v3-0-0.sav d8	age_education
age_education	ZA5929_v3-0-0.sav d8	age_education
age_education	ZA5932_v3-0-0.sav d8	age_education
age_education	ZA5964_v2-0-0.sav d8	age_education
age_education	ZA5998_v2-0-0.sav d8	age_education
age_education	ZA6595_v3-0-0.sav d8	age_education
age_education	ZA6642_v4-0-0.sav d8	age_education
age_education	ZA6643_v4-0-0.sav d8	age_education
age_education	ZA6694_v2-0-0.sav d8	age_education
age_education	ZA6695_v1-1-0.sav d8	age_education
age_education	ZA6697_v1-1-0.sav d8	age_education
age_education	ZA6788_v1-5-0.sav d8	age_education
age_education	ZA6861_v1-2-0.sav d8	age_education
age_education	ZA6863_v1-0-0.sav d8	age_education
age_education	ZA6925_v1-0-0.sav d8	age_education
age_education	ZA6927_v1-0-0.sav d8	age_education
age_education	ZA6928_v1-0-0.sav d8	age_education
age_education	ZA6939_v2-0-0.sav d8	age_education
age_education	ZA6963_v1-0-0.sav d8	age_education

Print the most common standardized variable labels.

dtm2 %>%
  count(var_label_std) %>%
  arrange(desc(n)) %>% 
  head(., 25) %>%
  kable %>%
  kable_styling(bootstrap_options =
                  c("striped", "hover", "condensed"),
                  fixed_thead = T,
                  font_size = 10 )

var_label_std	n
generation	132
internet_frequency	74
generation_age	63
age_education	38
age_exact	38
community_type	38
gender	38
household-composition_aged	38
household-composition_aged_gt	38
household-composition_aged_lithuania	38
occupation_respondent	38
occupation_respondent_last-job	38
paying_bills_difficulties	38
status_marital	38
internet_frequency_home	37
occupation_respondent_c14_scale	36
life_satisfaction	35
political-discussion_matters_european	35
political-discussion_matters_local	35
political-discussion_matters_national	35
nationality_denmark	34
left-right_placement	32
country_voice_counts	31
cat_assessment_social-class	30
eu_voice_counts	30

var_label_std	n
life_satisfaction	35
left-right_placement	32
cat_assessment_social-class	30
country_democracy_satisfaction	18
democracy_satisfaction_european-union	18
trust_european-union	16
trust_national-government	16
trust_national-parliament	16
trust_political-parties	16
trust_regional-local-authorities	15
trust_european-central-bank	14
trust_european-commission	14
trust_european-parliament	14
trust_united-nations	13
turkish-cypriot-community_trust_european-union	10
turkish-cypriot-community_trust_united-nations	10
trust_army	9
trust_police	9
turkish-cypriot-community_trust_political-parties	9
assessment_social-class	8
trust_public_administration	7
internet_trust	6
trust_radio	6
trust_television	6
trust_social_online_networks	5