library(eurobarometer)
library(tidyverse)
library(tibble)
library(knitr)
library(kableExtra)
library(tidytext)
library(janitor)
library(stringi)
An important aspect of survey data harmonization - in addition to standardizing response coding and value labels - is the identification of variables that measure the same thing, and standardizing the names of these variables. The process is faciliated by the fact that variables typically have labels that succintly describe the content of the variable, and sometimes also contain information about the coding of responses (e.g., length of scale), the unit (e.g., age in years or categories) and whether the variable contains “original” respondent responses or some transformation of responses (e.g. recoding or index).
Fortunately, in many multi-wave surveys, such as the Eurobarometer, there is a degree of consistency in how variables are labelled in different editions of the survey, and most survey projects try to follow some labelling conventions.
Unfortunately, consistency is limited and conventions are not followed strictly, which makes standardizing labels a semi-manual task. Some of these issues stem from the limits in the number of characters of variable labels in common statistical software packages (e.g., SPSS), which leads to non-standard abbreviations.
This vignette outlines a procedure for arriving at standardized variable names via tokenization of normalized variable labels (var_label_norm
), which are created by label_normalize
as part of gesis_metadata_create
. The logic is similar to text analysis, in that a document-term matrix is created with each variable label treated as a (very short) document, and terms treated as keywords. variables that contain the same sets of keywords can be - with some margin of error - assumed to correspond to the same or equivalent survey questions.
Read survey and variable metadata, match variables to variable groups as defined by GESIS-ZACAT, filter out technical and protocol variables,
Standardize the spelling of certain often abbreviated words (e.g., ‘gov’, ‘govnmt’, and ‘govmnt’ for ‘government’),
Tokenize,
Eliminate stop words (but be careful to not eliminate too many, e.g. ‘right’ or ‘working’),
Create a document-term matrix (actually, variable label - keyword matrix),
Sort the columns according to the declining frequency of the keywords,
Create standardized variable labels by concatenating keywords.
The table contains metadata from 97 Eurobarometer files (each file corresponding to one EB round). The metadata are extracted from SPSS files with gesis_metadata_create
.
Mapping of GESIS Archive IDs (as used in file names) to wave numbers, from EB/44.2bis (ZA2828) to EB/92.1 (ZA7579) - 156 EB waves altogether.
eb_names <- rio::import("../data-raw/eb_waves_filenames.xlsx") %>%
separate(x, c("archive_id", "x2", "title"), sep = ":") %>%
separate(x2, c("wave", "dates"), sep = " \\(|\\)") %>%
mutate(wave = gsub("Eurobarometer | ", "", wave))
eb_names %>%
select(archive_id, wave, dates) %>%
head() %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
archive_id | wave | dates |
---|---|---|
ZA7579 | 92.1 | 2019 |
ZA7576 | 91.5 | June-July 2019 |
ZA7575 | 91.4 | 2019 |
ZA7572 | 91.3 | 2019 |
ZA7562 | 91.2 | 2019 |
ZA7561 | 91.1 | February-March 2019 |
Mapping of variable labels to variable groups, as defined in GESIS ZACAT, for 51 waves from EB 79.1 (ZA5687) through EB 92.1 (ZA7579).
varlabs <- rio::import("../data-raw/eb_variable_groups_zacat.xlsx", sheet = 4) %>%
mutate(wave_desc = ifelse(substr(var_label_zacat, 1, 13) == "Eurobarometer", var_label_zacat, NA),
var_label_zacat = ifelse(substr(var_label_zacat, 1, 13)=="Eurobarometer", NA, var_label_zacat),
# group names are writte in camel case and != toupper(group names)
group = ifelse(var_label_zacat != toupper(var_label_zacat), var_label_zacat, NA),
# exceptions from the rule identifying group names
group = ifelse(substr(group, 1, 9) == "OWNERSHIP", NA, group),
group = ifelse(substr(group, 1, 19) == "POSITIVE EU RESULTS", NA, group),
group = ifelse(substr(group, 1, 24) == "OPINION LEADERSHIP INDEX", NA, group),
var_label_zacat = ifelse(!is.na(group), NA, var_label_zacat)) %>%
fill(wave_desc, group) %>%
drop_na(group, var_label_zacat) %>%
mutate(wave = substr(wave_desc, 1, 18),
wave = gsub("^Eurobarometer ", "", wave),
var_label_zacat = gsub("^W[0-9]{1,2}[A-Z] |^W3A/W4A ", "", var_label_zacat)) %>%
left_join(eb_names) %>%
select(wave, archive_id, group, var_label_zacat, everything())
varlabs %>%
select(archive_id, wave, group, var_label_zacat) %>%
head() %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
archive_id | wave | group | var_label_zacat |
---|---|---|---|
ZA7579 | 92.1 | Archive and survey ID variables | ARCHIVE STUDY NUMBER - DISTRIBUTOR |
ZA7579 | 92.1 | Archive and survey ID variables | ARCHIVE STUDY NUMBER - PUBLISHER |
ZA7579 | 92.1 | Archive and survey ID variables | DIGITAL OBJECT IDENTIFIER |
ZA7579 | 92.1 | Archive and survey ID variables | GESIS ARCHIVE VERSION & DATE |
ZA7579 | 92.1 | Archive and survey ID variables | DATASET EDITION |
ZA7579 | 92.1 | Archive and survey ID variables | SURVEY IDENTIFICATION |
All procedures below applied to 51 waves, for which all metadata are available.
dict <- rio::import("../data-raw/var_lab_dict.xlsx",
sheet = "dict") %>%
gather(id, from, 2:6) %>%
select(from, to) %>%
drop_na(from)
head(dict) %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
from | to |
---|---|
^cap_ | common-agricultural-policy_ |
cap | common-agricultural-policy |
united_kingdom | united-kingdom |
czech_republic | czechia |
great_britain | great-britain |
united_germany | united-germany |
# this part now became very fast with stri_replace_all_regex
dat2 <- metadata_database %>%
# filter only waves for which variable groups are available
filter(archive_id %in% archive_ids,
# exclude this strange variable
var_label_orig != "filename") %>%
# keep one row per variable
distinct(filename, qb, var_name_orig, var_name_orig, var_label_orig, var_label_norm, archive_id) %>%
# original variable labels in the data and in GESIS ZACAT are not exactly the same
# --> need some cleaning
mutate(var_label_orig2 = gsub("^W[0-9]{1,2}", "", var_label_orig),
var_label_orig2 = gsub("^W[0-9]{1,2}[A-Z]", "", var_label_orig),
var_label_orig2 = gsub("^[0-9]{1-3} ", "", var_label_orig2),
var_label_orig2 = gsub("^[A-Z][0-9]{1,2}[A-Z] ", "", var_label_orig2),
var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2} ", "", var_label_orig2),
var_label_orig2 = gsub("^[A-Z]{1,2}[0-9]{1,2}[A-Z] ", "", var_label_orig2),
var_label_orig2 = gsub("^[0-9]{1,2} ", "", var_label_orig2),
var_label_orig2 = gsub("^[A-Z][0-9] [0-9]{1,2} ", "", var_label_orig2),
var_label_orig2 = gsub("^W3A/W4A ", "", var_label_orig2)) %>%
# join with table with variable groups
left_join(varlabs, by = c("var_label_orig2" = "var_label_zacat", "archive_id")) %>%
# remove variables in the following groups
filter(!group %in% c("Archive and survey ID variables",
"Country group variables",
"Nationality (Q1)",
"Protocol variables (P)",
"Standard nation ID variables",
"Weight variables (W)",
"Nation ID variables",
"Protocol variables",
"Weight variables",
"Archive and ID variables",
"Household Weights"),
# remove recoded variables
!grepl("recoded|(rec)", var_label_norm)) %>%
mutate(var_id = paste0(filename, " ", var_name_orig),
# leftover question numbers at the beginning of value labels
var_label_norm1 = gsub("^w[0-9]{1,2}", "", var_label_norm),
var_label_norm1 = gsub("^[0-9]{1-3}_", "", var_label_norm1),
var_label_norm1 = gsub("^[a-z][0-9]{2}[a-z]_", "", var_label_norm1),
var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}_", "", var_label_norm1),
var_label_norm1 = gsub("^[a-z]{1,2}[0-9]{1,2}[a-z]_", "", var_label_norm1),
var_label_norm1 = gsub("^[0-9]{1,2}_", "", var_label_norm1),
var_label_norm1 = gsub("^[a-z]_", "", var_label_norm1),
var_label_norm1 = gsub("^[a-z][0-9]_[0-9]{1,2}_", "", var_label_norm1)) %>%
# apply the translation table
mutate(var_label_norm1 = stri_replace_all_regex(var_label_norm1,
pattern = dict$from,
replacement = dict$to,
vectorize_all = FALSE)) %>%
# replace underscore with space
mutate(text = gsub("_", " ", var_label_norm1)) %>%
ungroup() %>%
# select the needed variables
select(filename, qb, var_label_orig, var_label_norm, var_label_norm1, text, var_id, group)
# list of country codes to be converted to country names
eu_countries <- c(
"at", "be", "bg", "cy", "cz", "dk", "de", "ee", "es", "fr",
"fi", "gb", "gr", "hr", "hu", "ie", "it", "lt", "lv", "lu",
"mt", "nl", "pl", "pt", "ro", "se", "si", "sk", "tr")
# !!! the problem here is that dk sometimes means "don't know" and sometimes Demnark !!!
# this needs to be treated in gesis_metadata_create, where the
# user defined missings are separately analyzed, they dk should
# be converted there to do_not_know.
# it's not clear hot to do this, because both types of DK appear at the end of labels e.g. in
# mentioned/not mentioned questions, and they look very much the same
# ---- but normally one should be in attr(x, "na_values")
tidy <- dat2 %>%
# tokenize
unnest_tokens(word, text,
token = "regex", pattern = " | ") %>%
# eliminate stop words, with some exceptions
anti_join(stop_words %>%
filter(!word %in% c("working", "right", "other")
)) %>%
# eliminate numbers
filter(!grepl("\\b\\d+\\b", word)) %>%
# convert country codes to country names
mutate(
country = ifelse(word %in% eu_countries, word, NA),
country = countrycode::countrycode(
toupper(country), "iso2c", "country.name"),
country = tolower(country),
word = ifelse(!is.na(country), country, word),
word = ifelse(word == "united kingdom", "united-kingdom", word),
word = ifelse(word == "tcc", "turkish-cypriot-community", word)
) %>%
select(-country)
tidy %>%
select(var_id, var_label_norm1, word) %>%
head() %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
var_id | var_label_norm1 | word |
---|---|---|
ZA5687_v3-0-0.sav d11 | age_exact | age |
ZA5687_v3-0-0.sav d11 | age_exact | exact |
ZA5687_v3-0-0.sav qb9c_9 | bribe_amount_in_euro_building_permits | bribe |
ZA5687_v3-0-0.sav qb9c_9 | bribe_amount_in_euro_building_permits | amount |
ZA5687_v3-0-0.sav qb9c_9 | bribe_amount_in_euro_building_permits | euro |
ZA5687_v3-0-0.sav qb9c_9 | bribe_amount_in_euro_building_permits | building |
tidy %>%
count(word) %>%
arrange(desc(n)) %>%
head(., 10) %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
word | n |
---|---|
eu | 2934 |
sum | 920 |
spontaneous | 783 |
turkish-cypriot-community | 703 |
denmark | 700 |
2nd | 697 |
other | 617 |
nationality | 558 |
country | 530 |
internet | 530 |
Note: this Denmark is probably a “don’t know”
Filter out labels that have words that appear in < 5 different waves. If we want to create a trend file, we’re only interested in variables that are repeated often.
tidy_small_2 <- tidy %>%
group_by(word) %>%
# number of distinct file names (waves) per word
mutate(n_wave = n_distinct(filename)) %>%
# flag words that appear in < 5 different files
mutate(to_drop = as.numeric(n_wave < 10)) %>%
# group by variable ID (file x variable name)
group_by(var_id) %>%
# copy to_drop across words in the same variable label
mutate(to_drop = max(to_drop, na.rm = TRUE)) %>%
# drop variable labels with the to_drop identifier
filter(to_drop != 1) %>%
# remove auxiliary variables
select(-to_drop, -n_wave) %>%
ungroup()
dtm <- tidy_small_2 %>%
distinct(var_id, var_label_norm1, word) %>%
mutate(n = word) %>%
spread(word, n)
dim(dtm)
#> [1] 6632 429
Standardized variable labels var_label_std
are concatenated keywords, where the keywords occur in the order of their frequency in the entire corpus of variable labels. As a consequence, more common - likely broader - terms will come first, and less common - more specific - terms will come later.
dtm2 <- dtm %>%
select( names(sort(order)) ) %>%
mutate_at(vars(3:ncol(dtm)),
funs(ifelse(is.na(.), "", .))
) %>%
unite("var_label_std", 3:ncol(dtm)) %>%
mutate(var_label_std = gsub(
"(\\_)\\1+", "\\1", var_label_std) ) %>%
mutate(var_label_std = gsub(
"^_|_$", "", var_label_std) )
dim(dtm2)
#> [1] 6632 3
These standardized variable names still have problems that need to be fixed.
dtm2 %>%
head(., 25) %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
var_label_norm1 | var_id | var_label_std |
---|---|---|
age_education | ZA5687_v3-0-0.sav d8 | age_education |
age_education | ZA5688_v6-0-0.sav d8 | age_education |
age_education | ZA5689_v2-0-0.sav d8 | age_education |
age_education | ZA5876_v2-0-0.sav d8 | age_education |
age_education | ZA5877_v2-0-0.sav d8 | age_education |
age_education | ZA5913_v2-0-0.sav d8 | age_education |
age_education | ZA5928_v3-0-0.sav d8 | age_education |
age_education | ZA5929_v3-0-0.sav d8 | age_education |
age_education | ZA5932_v3-0-0.sav d8 | age_education |
age_education | ZA5964_v2-0-0.sav d8 | age_education |
age_education | ZA5998_v2-0-0.sav d8 | age_education |
age_education | ZA6595_v3-0-0.sav d8 | age_education |
age_education | ZA6642_v4-0-0.sav d8 | age_education |
age_education | ZA6643_v4-0-0.sav d8 | age_education |
age_education | ZA6694_v2-0-0.sav d8 | age_education |
age_education | ZA6695_v1-1-0.sav d8 | age_education |
age_education | ZA6697_v1-1-0.sav d8 | age_education |
age_education | ZA6788_v1-5-0.sav d8 | age_education |
age_education | ZA6861_v1-2-0.sav d8 | age_education |
age_education | ZA6863_v1-0-0.sav d8 | age_education |
age_education | ZA6925_v1-0-0.sav d8 | age_education |
age_education | ZA6927_v1-0-0.sav d8 | age_education |
age_education | ZA6928_v1-0-0.sav d8 | age_education |
age_education | ZA6939_v2-0-0.sav d8 | age_education |
age_education | ZA6963_v1-0-0.sav d8 | age_education |
Print the most common standardized variable labels.
dtm2 %>%
count(var_label_std) %>%
arrange(desc(n)) %>%
head(., 25) %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
var_label_std | n |
---|---|
generation | 132 |
internet_frequency | 74 |
generation_age | 63 |
age_education | 38 |
age_exact | 38 |
community_type | 38 |
gender | 38 |
household-composition_aged | 38 |
household-composition_aged_gt | 38 |
household-composition_aged_lithuania | 38 |
occupation_respondent | 38 |
occupation_respondent_last-job | 38 |
paying_bills_difficulties | 38 |
status_marital | 38 |
internet_frequency_home | 37 |
occupation_respondent_c14_scale | 36 |
life_satisfaction | 35 |
political-discussion_matters_european | 35 |
political-discussion_matters_local | 35 |
political-discussion_matters_national | 35 |
nationality_denmark | 34 |
left-right_placement | 32 |
country_voice_counts | 31 |
cat_assessment_social-class | 30 |
eu_voice_counts | 30 |
Filter on substrings of standardized variable labels var_label_std
, and count the repetitions across datasets.
# filter on var_label_std
dtm2 %>%
filter(grepl("trust|satisf|left-right|interest|class", var_label_std)) %>%
count(var_label_std) %>%
arrange(desc(n)) %>%
head(., 25) %>%
kable %>%
kable_styling(bootstrap_options =
c("striped", "hover", "condensed"),
fixed_thead = T,
font_size = 10 )
var_label_std | n |
---|---|
life_satisfaction | 35 |
left-right_placement | 32 |
cat_assessment_social-class | 30 |
country_democracy_satisfaction | 18 |
democracy_satisfaction_european-union | 18 |
trust_european-union | 16 |
trust_national-government | 16 |
trust_national-parliament | 16 |
trust_political-parties | 16 |
trust_regional-local-authorities | 15 |
trust_european-central-bank | 14 |
trust_european-commission | 14 |
trust_european-parliament | 14 |
trust_united-nations | 13 |
turkish-cypriot-community_trust_european-union | 10 |
turkish-cypriot-community_trust_united-nations | 10 |
trust_army | 9 |
trust_police | 9 |
turkish-cypriot-community_trust_political-parties | 9 |
assessment_social-class | 8 |
trust_public_administration | 7 |
internet_trust | 6 |
trust_radio | 6 |
trust_television | 6 |
trust_social_online_networks | 5 |