Harmonizing Eurobarometer files

Daniel Antal, CFA

2018-05-02

Import a GESIS SPSS File

Convert SPSS files to a less structured file type

SPSS structures survey information in a peculiar way that can be modelled in R with the labelled variable class. Each variable is described by a short variable name (code) and a long variable name (label). Furthermore, categorical variables have a numeric code and a label, for example, 3 = few.

The aim is to concert this data file without loss of information into a flat, less structured, tabular format. This is achieved with the creation of two files, a metadata file and a data file. The data files is a flat and unambiguous translation of the SPSS file, while the metadata file contains all information to make the process reproducible (which variable was converted to what) and to provide information for other, unambiguous conversions.

Convert variable names to machine readable names

When working in a programmatic data analysis, the problem of the SPSS variable names is that they are almost never useful in program code. The short codes are often fully meaningless, and in the large files with 700-800 variables names keeping track of them is very difficult. Sometimes the short codes, and the long names almost always contain special characters and regular expressions that cannot be used in programming.

The function variable_name_suggest is suggesting a machine-readable name that is derived from the long SPSS variable name label. For example, the ARCHIVE STUDY NUMBER – DISTRIBUTOR is converted to archive_study_number_distributor.

Representation of data in R

The SPSS structured filed are most faithfully represented by the labelled class created for this purpose in the haven package. Further statistical tools are available to work with them directly in the labelled package. The problem with the labelled class is that it is not a standard object class, and for using the wide arrange of R statistical packages, the data should be converted to base R atomic types.

A faithful representation of the SPSS file is of course only a first step, because SPSS files can be best analyzed in SPSS. For taking advantage of R, they need to be converted into a data representation that can be directly used with base R and its package extensions.

The package translates SPSS unlabeled, clear numerical variables into numeric, and unrecognized other variables into factor, which is a faithful representation of the SPSS data. However, it is not necessarily useful for harmonization. New, intermediary classes are introduced for typical variables, for example, categorical variables with 3 positive, 4 positive, or 1 negative, 1 neutral and 1 positive values. These classes contain the numeric, factor and character representation of the SPSS data.

The numeric representation is usually very useful for data harmonization, because it contains no natural language issues such as spelling, casing, or character set problems. It also allows the fast execution of many statistical operations. For harmonization currently only the numeric representation is used, but later this will be further developed for the more nuanced factor representation.

The numeric representation is prone to erroneous interpretation, because it makes mainly ordinal categorical variables appear to be nominal variables. I suggest to convert numerical representations with as.factor() in modelling [and not using the as_factor () method described below.]

Currently all intermediary questionnaire objects, which cover about 75% of the questions in the SPSS data files of the last decade have an as_numeric method that correctly, consistently converts the SPSS categorical values to a numeric format. For example, yes, agree, approve consistently gets an 1 value while no, disagree, not agree, disapprove, not approve a 0 value. Beware that the native as.numeric conversion will represent the original raw data of the SPSS file without the labels, which an almost random number, so do not use it.

require (eurobarometer)
survey_answers <- eurobarometer::as_factor_pos_neg(
    c("Better", "DK", "Worse",
      "Same", "The Same", "Inap. not")
      )

as_numeric   (survey_answers)
#> [1]  1 NA -1  0  0 NA
as_factor    (survey_answers)
#> [1] positive <NA>     negative neutral  neutral  <NA>    
#> Levels: negative neutral positive
as_character (survey_answers)
#> [1] "better"    "dk"        "worse"     "same"      "the same"  "inap. not"

The other two representations are a bit ‘under the hood’ explanations and are intented for future developers of the package, or future creators of harmonized files, such as trend files, and can be ignored by researchers.

Factor representation

The factor representation is more adequate for the actual data analysis, and it is probably the only practical representation for descriptive statistics. The problem with factors is that they need two levels of harmonization: harmonization of the ordering, and harmonization of the labels (levels in R). Currently a comprehensive vocabulary is being built for this purpose with the package. Currently the vocabulary covers about 75% of or categorical questionnaire items from the last decade. For a detailed harmonization more use cases and discussions with GESIS and users would be preferred, because factor levels are not natural-language independent.

The as_factor () method convert the intermediary questionnaire item objects into a factor representation in R. As a method, it takes into consideration of 3, 4 or 5 level categorical variables, for example, i.e. it has offers different variations of the generic as.factor for different type of questions. At this stage, harmonization stops at the most frequently used categories, and the harmonization, if necessary, correction of the ordering. About a quarter of rarely, or only once used questions are not harmonized.

A current, dirty factor representation can be achieved with the combination of the native type conversion and the harmonized numeric method: as.factor(as_numeric('foo')). For example, this will make sure that all harmonized categories with values yes, approve, agree, support will get a label 1 and all no, disapprove, not approve, disagree, not support will get a label 0. This is also a convenient way to integrate, for example, English and Slovak questionnaire items, where different character coding and multiple values would make integration rather difficult directly via factor levels (labels in SPSS).

Character representation

The character representation in R is a simplification of the factor representation. While it has little use in harmonizing, processing and analyzing the data, it is often very practical in documentation and visualization.
The as_character () method provides an English character representation of the individual answers without special characters. Full harmonization at this level will be the last step when a comprehensive vocabulary is available. Currently harmonization partly avoids different abbreviations and special characters, but it is not consistent at the character string level.

Representing missing values

R has a special object for a missing value, NA, which explicitly states that a value is missing, and not just not read. Missingness can have many sources, for example, the question was not asked from a person (such as certain questions in the Turkish Cypriot community in Eurobarometer surveys) or the persona declined to answer.

GESIS usually uses some abbreviation of Inappropriate for the first one, and Decline or DK for the other. Recall the earlier output:

require (eurobarometer)
survey_answers <- eurobarometer::as_factor_pos_neg(
    c("Better", "DK", "Worse", "Same", "The Same", "Inap. not")
    )
as_numeric   (survey_answers)
#> [1]  1 NA -1  0  0 NA
as_factor    (survey_answers)
#> [1] positive <NA>     negative neutral  neutral  <NA>    
#> Levels: negative neutral positive
as_character (survey_answers)
#> [1] "better"    "dk"        "worse"     "same"      "the same"  "inap. not"

summary(as_factor(survey_answers))
#> negative  neutral positive     NA's 
#>        1        2        1        2
summary(as_numeric(survey_answers))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   -1.00   -0.25    0.00    0.00    0.25    1.00       2

The factor harmonization currently is not resoldved, and all missingness is represented by NA, so if you want to examine the reason of missingness, you can create dirty factors with as.character(as_factor(var_name)).

In my view, sometimes the inap. label has typos that should be corrected in the GESIS files. A comprehensive list will be provided when the vocabulary harmonization is more developed, with more users and consultations with GESIS.

Harmonizing variable names across files

“There are only two hard things in Computer Science: cache invalidation and naming things.”

—————- Phil Karlton

When working with GESIS files, the webcat and the thoughtful labelling makes working with a single survey. The data can be analyzed with the help of the generic (English and French) survey questionnaire. However, when working with multiple files, it become apparent that the GESIS files are not coded consistently.

Short names are questionnaire IDs

Sometimes the short variable name is the questionnaire item. For example, the generic questionnaire identifies a question as QA11, which is the 11the question in the question block A. Or QB6_3 is the 3rd item of the 6th structured in the question block B. Furthermore, TNS uses some consistent questionnaire item IDs for repeating items, such a demography questions. However GESIS sometimes uses these IDs in the SPSS file, sometimes not, and sometime only in the demography block. When possible, get_eb_questionnaire_item() retrieves this information and records it in the metadata file. However, this information (and a programmatic connection between the questionnaire PDF files and the SPSS or flat files) is not possible because often the questionnaire ID is not present in the SPSS file.

Short names contain important metadata

Sometimes there is a split in the data file, because certain questions were not asked, or differently asked outside of the EU, for example in the accession countries, or very often in the special territory of the Turkish community of Cyprus. There is also a consistent naming of weights. Whenever possible, this information is taken into consideration in the creation of harmonized long variable names.

Short names contain special characters or typos

We do not deal with these issues, because the short variable names are not consistent and generally we only record them in the metadata file but do not use them.

Long names contain the questionnaire ID

In some cases, the questionnaire ID is added to the long variable label, for example, QA1 xx. This is clearly disturbing the variable name harmonization. In this case a regex removes the questionnaire ID and records it in the metadata, similarly to such items if they are found in the short variable names. They are easier to detect in the long variable labels, and this conversion works very well. The harmonized variable name is derived from the rest of the label, after removing the questionnaire ID and the subsequent space or underscore.

Long names contain the special characters

Because we want to create variable names that can be used programmatically, we remove all special characters. In most of the cases this means the removal of a hypen. The % sign is changed to _pct, the + sign to _p and the - sign to _m. For example, AGED 15+ becomes aged_15p.

You can review all current changes in the source code of the var_name_suggest() function.

Long name contains whitespace

Unnecessary whitespace, such as accidental double space is removed from the long variables labels before creating the new variable name.

Long name contains typos and other errors

This is the most problematic part, and in my view, such errors should be corrected, whenever not fully archived, in the GESIS SPSS files. These errors can be detected with hard work and they almost always require human judgement, and often a lot of comparison across large SPSS files. These are rare and easy-to-correct errors, and there is no programmatic solution offered to them. However, very soon almost all of them will be detected at least in the archive of since the Mannheim Trend Files.

Harmonizing identical questionnaire items

Identical questionnaire item has several long names

Technically speaking this is as problematic as the previous problem, but only in the context of integrated multiple files or creating trend files. These alternative naming will be collected and added as a data file to the R package. Some often used name variations in weights and demography are already discovered and harmonized.

The current vocabulary can be found in the github repository of the package as data-raw/Vocabulary.xlsx which is converted into the vocabulary file by data-raw/create_vocabulary_rda.R The author welcomes all suggestions and improvements to the vocabulary file. You can access the vocabulary in R directly with eurobarometer::vocabulary.R. In the context of 4 non-negative categorical variables you can review the current vocabulary with vocabulary_items_get( context = "factor_4").

name context neg_2 neg_1 neutral pos_1 pos_2 pos_3 36 12 3 0 6
attached_4_1 factor_4 NA NA Not at all attached Not very attached Fairly attached Very attached NA NA NA NA NA
confident_4_1 factor_4 NA NA Not at all confident Not very confident Fairly confident Very confident NA NA NA NA NA
frequency_4_1 factor_4 NA NA Never Rarely From time to time Often NA NA NA NA NA
good_4_1 factor_4 NA NA Bad Average Fair Excellent NA NA NA NA NA
important_4_1 factor_4 NA NA Not at all important Not very important Fairly important Very important NA NA NA NA NA
important_4_2 factor_4 NA NA Not at all important Not very important Important Very important NA NA NA NA NA

These factors have only postive values, or rather, meanings, which are coded as 0, 1, 2, 3.

name context neg_2 neg_1 neutral pos_1 pos_2 pos_3 36 12 3 0 6
amount_3_1 factor_pos_neg NA Doing too much Doing about the right amount Not doing enough NA NA NA NA NA NA NA
amount_3_2 factor_pos_neg NA Too much About the right amount Not enough NA NA NA NA NA NA NA
applies_3_1 factor_pos_neg NA Applies fairly badly Neither Applies fairly well NA NA NA NA NA NA NA
better_3_1 factor_pos_neg NA Worse Same Better NA NA NA NA NA NA NA
change_3_1 factor_pos_neg NA Less important No change / As it is now More important NA NA NA NA NA NA NA
direction_3_1 factor_pos_neg NA Wrong direction Neither Right direction NA NA NA NA NA NA NA

These factors are coded as -1, 0, 1. This coding is meaningful with logical operations, for example.

Creating a vocabulary of questionnaire items

The next obstacle of integrating data files is that identical questionnaire answer options are coded differently. For example, women respondents are labeled as woman or female. The solution here is the creation of a vocabulary file that harmonizes identical answers across files.

Same answer options are abbreviated differently

This is far the most common obstacle, and it can be resolved with the creation of the vocabulary file. For example, in the type of community demography questions the answer option Small or medium-sized town is sometimes coded as is, and sometimes as Small/middle town. This variable can be converted into a character or a factor type. Character vectors have no ordinal values, i.e. they are not ordered, and factors are ordered. Soon there will be a common factor and character representation for these variables, but currently a different route is taken, they are unambiguously converted to numbers. Numerical conversion has several advantages. For example, it allows easy integration with non-GESIS data files that use different natural languages or character sets. They can also facilitate the harmonization of character and factor values.

Different answer options have the same meaning

These issues require judgement, and eventually may lead to different interpretations. The vocabulary in this respect was rather cautiously harmonized.

The current Eurobarometer questionnaires have two generic versions, English and French. It appears that sometimes the true master language is English, sometimes French, and in the translation of the questionnaire item synonyms are used interchangeably. I believe that in the future the company making the surveys should try to avoid these divergences, but in many cases, they are hardly problematic, because the generic questions are translated to other natural languages, anyway.

Consider the following categorical ranges

  1. Not at all informed, Not very informed, Fairly well informed, Very well informed

  2. Not at all informed, Not very well informed, Fairly well informed, Very well informed

In my view these items are identical, and they are most likely translated identically to other languages. What is important that Not very informed and Not very well informed are consistently placed in the ordering of 2nd (ascending) or 3rd (descending). The package uses the ascending sorting, which is, if necessary, can be easily reverted programatically.

Creating trend files

The ultimate aim of the package is to create panel data or trend files, i.e. files that contain the same survey parts for all countries in different points of time. With some questions, this is already possible with the package, and some trend files will be published soon.

Considerable judgement is needed until when two questions are identical, given the two original source languages and the more than 20 language versions. The package can help this in two ways:

  1. Creation of combined metadata files and highlighting the repeating questions, which are obvious in the case of questions created for trending purposes and demography, but not obvious in other repeated questions. The metadata files created by gesis_file_read()and analyze_file_metadata() with its helper functions solve this problem. While question identification can be further developed, it also raises the question of variable name harmonization, where the inputs of GESIS and other users would be helpful to create an optimal solution.

  2. Creation of the comprehensive vocabulary of answer items. class_conversion_suggest() in fact helps the creation fo the vocabulary because it analyzes the content of variables in individual SPSS files to classify them to a common type, for example, the factor_binomial class, which contains all the repeating binary questionnaire items, such as agree-disagree, female-male, for-against. Finding new patterns and aligning them with the current classes factor_3, factor_4, factor_5, factor_pos_neg, factor_pos_neg_4, factor_frequency, and the almost numerical and date formats help the development of the vocabulary.

  3. Easy-to-join, harmonized, flat data files, which are after or what this package creates. I believe that in this respect the package already contributes a lot research efforts, because it appears to be working with few problems, and at the moment no recognizable errors. But the first two problems would require consultations with other users and GESIS.

Currently the package correctly harmonizes about 75% of the last decades questions, but still struggles with identifying identical questions. The best approach is to realize the current version of the package, and allow other researchers to contribute to the vocabulary, and to discuss naming conventions.

The vocabulary should be centralized, and frequently redistributed in new versions of the package, so any suggestions are welcome. I created an Excel version of the vocabulary so that it is easy to make suggestions. This part of the package is very easy to maintain, so in the first year the vocabulary file can be updated in every few weeks.

Variable naming would need a consensus of users, and some consultation with GESIS. The metadata files created by this package can greatly facilitate the correction of small errors, and the adoption of more comprehensive practices that will eventually lead to easier to use SPSS and STATA files, too.

Longitudinal integration issues

The programmatic longitudinal integration of the data should be based on the date of interview variable. This is an important metadata, because the time difference between repeated questions can be measured by the typical response days. Another problem that in rare occasions certain country survey waves for whatever practical reason are held in a different month than the bulk of the Eurobarometer survey.

As far as I see, some SPSS files do not have a date variable at all (or they are coded very differently from most files.) In my view, this may be a file conversion error that happened in the stage of creating the GESIS archive files, and, if the data is available, must be corrected in all cases.

Comparison of languages

My original motivation is to integrate Eurobarometer surveys via standardized questions with national surveys. International comparison just raises issues with comparability a step further than researchers of the Eurobarometer archives. Some questions had been similar, but slightly altered over the five decades of surveying, and translations offered other deviations.

I am very much interested in any contribution to this issue, in the creation of multi -language vocabularies and question data banks.

One serious problem now is although all national language questionnaires are available, they are raw pdf files, and any compilation programmatically appears to be a very great task. But eventually the creation of a standardized translation vocabulary would help the creation of new comparable surveys, data products and analytical work. I have experience with using the standardized questions of the European Cultural Access and Participation surveys, and I will include some examples in a separate vignette.

Sub-national aggregation

While the Eurobarometer surveys have a very well developed and useful demography that allows international comparisons, the sub-national level of information varies country-by-country, partly following local standards, and partly due to the different sizes of countries.

Technically it possible to create integration via urbanization, or at sub-national level with some limits, similarly to the alternative demographical organization of cohorts, that allow meaningful time-wise comparisons.

It would be very interesting to find collaborators in this field, because the creation of sub-national statistics is not extremely challenging from a programmatic point of view, but requires a thorough understanding of the statistical regions and administrative systems of various countries.

An interesting applicaton could be the creation of trend files for larger urban areas, such as London, Paris, Berlin or Budapest that are significantly big to have large subsamples in the national samples.