SPSS structures survey information in a peculiar way that can be modelled in R with the labelled
variable class. Each variable is described by a short variable name (code) and a long variable name (label). Furthermore, categorical variables have a numeric code and a label, for example, 3 = few
.
The aim is to concert this data file without loss of information into a flat, less structured, tabular format. This is achieved with the creation of two files, a metadata file and a data file. The data files is a flat and unambiguous translation of the SPSS file, while the metadata file contains all information to make the process reproducible (which variable was converted to what) and to provide information for other, unambiguous conversions.
When working in a programmatic data analysis, the problem of the SPSS variable names is that they are almost never useful in program code. The short codes are often fully meaningless, and in the large files with 700-800 variables names keeping track of them is very difficult. Sometimes the short codes, and the long names almost always contain special characters and regular expressions that cannot be used in programming.
The function variable_name_suggest is suggesting a machine-readable name that is derived from the long SPSS variable name label. For example, the ARCHIVE STUDY NUMBER – DISTRIBUTOR
is converted to archive_study_number_distributor.
The SPSS structured filed are most faithfully represented by the labelled
class created for this purpose in the haven package. Further statistical tools are available to work with them directly in the labelled package. The problem with the labelled
class is that it is not a standard object class, and for using the wide arrange of R statistical packages, the data should be converted to base R atomic types.
A faithful representation of the SPSS file is of course only a first step, because SPSS files can be best analyzed in SPSS. For taking advantage of R, they need to be converted into a data representation that can be directly used with base R and its package extensions.
The package translates SPSS unlabeled, clear numerical variables into numeric, and unrecognized other variables into factor, which is a faithful representation of the SPSS data. However, it is not necessarily useful for harmonization. New, intermediary classes are introduced for typical variables, for example, categorical variables with 3 positive, 4 positive, or 1 negative, 1 neutral and 1 positive values. These classes contain the numeric, factor and character representation of the SPSS data.
The numeric representation is usually very useful for data harmonization, because it contains no natural language issues such as spelling, casing, or character set problems. It also allows the fast execution of many statistical operations. For harmonization currently only the numeric representation is used, but later this will be further developed for the more nuanced factor representation.
The numeric representation is prone to erroneous interpretation, because it makes mainly ordinal categorical variables appear to be nominal variables. I suggest to convert numerical representations with as.factor()
in modelling [and not using the as_factor ()
method described below.]
Currently all intermediary questionnaire objects, which cover about 75% of the questions in the SPSS data files of the last decade have an as_numeric method that correctly, consistently converts the SPSS categorical values to a numeric format. For example, yes
, agree
, approve
consistently gets an 1
value while no
, disagree
, not agree
, disapprove
, not approve
a 0
value. Beware that the native as.numeric conversion will represent the original raw data of the SPSS file without the labels, which an almost random number, so do not use it.
require (eurobarometer)
survey_answers <- eurobarometer::as_factor_pos_neg(
c("Better", "DK", "Worse",
"Same", "The Same", "Inap. not")
)
as_numeric (survey_answers)
#> [1] 1 NA -1 0 0 NA
as_factor (survey_answers)
#> [1] positive <NA> negative neutral neutral <NA>
#> Levels: negative neutral positive
as_character (survey_answers)
#> [1] "better" "dk" "worse" "same" "the same" "inap. not"
The other two representations are a bit ‘under the hood’ explanations and are intented for future developers of the package, or future creators of harmonized files, such as trend files, and can be ignored by researchers.
The factor representation is more adequate for the actual data analysis, and it is probably the only practical representation for descriptive statistics. The problem with factors is that they need two levels of harmonization: harmonization of the ordering, and harmonization of the labels (levels in R). Currently a comprehensive vocabulary is being built for this purpose with the package. Currently the vocabulary covers about 75% of or categorical questionnaire items from the last decade. For a detailed harmonization more use cases and discussions with GESIS and users would be preferred, because factor levels are not natural-language independent.
The as_factor ()
method convert the intermediary questionnaire item objects into a factor representation in R. As a method, it takes into consideration of 3, 4 or 5 level categorical variables, for example, i.e. it has offers different variations of the generic as.factor
for different type of questions. At this stage, harmonization stops at the most frequently used categories, and the harmonization, if necessary, correction of the ordering. About a quarter of rarely, or only once used questions are not harmonized.
A current, dirty factor representation can be achieved with the combination of the native type conversion and the harmonized numeric method: as.factor(as_numeric('foo'))
. For example, this will make sure that all harmonized categories with values yes
, approve
, agree
, support
will get a label 1
and all no
, disapprove
, not approve
, disagree
, not support
will get a label 0
. This is also a convenient way to integrate, for example, English and Slovak questionnaire items, where different character coding and multiple values would make integration rather difficult directly via factor levels (labels in SPSS).
The character representation in R is a simplification of the factor representation. While it has little use in harmonizing, processing and analyzing the data, it is often very practical in documentation and visualization.
The as_character ()
method provides an English character representation of the individual answers without special characters. Full harmonization at this level will be the last step when a comprehensive vocabulary is available. Currently harmonization partly avoids different abbreviations and special characters, but it is not consistent at the character string level.
R has a special object for a missing value, NA
, which explicitly states that a value is missing, and not just not read. Missingness can have many sources, for example, the question was not asked from a person (such as certain questions in the Turkish Cypriot community in Eurobarometer surveys) or the persona declined to answer.
GESIS usually uses some abbreviation of Inappropriate
for the first one, and Decline
or DK
for the other. Recall the earlier output:
require (eurobarometer)
survey_answers <- eurobarometer::as_factor_pos_neg(
c("Better", "DK", "Worse", "Same", "The Same", "Inap. not")
)
as_numeric (survey_answers)
#> [1] 1 NA -1 0 0 NA
as_factor (survey_answers)
#> [1] positive <NA> negative neutral neutral <NA>
#> Levels: negative neutral positive
as_character (survey_answers)
#> [1] "better" "dk" "worse" "same" "the same" "inap. not"
summary(as_factor(survey_answers))
#> negative neutral positive NA's
#> 1 2 1 2
summary(as_numeric(survey_answers))
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> -1.00 -0.25 0.00 0.00 0.25 1.00 2
NA
value.Decline
and DK
are treated similarly, but dissimilarly to Inap.
or Inappropirate
.The factor harmonization currently is not resoldved, and all missingness is represented by NA, so if you want to examine the reason of missingness, you can create dirty factors with as.character(as_factor(var_name))
.
In my view, sometimes the inap.
label has typos that should be corrected in the GESIS files. A comprehensive list will be provided when the vocabulary harmonization is more developed, with more users and consultations with GESIS.
“There are only two hard things in Computer Science: cache invalidation and naming things.”
—————- Phil Karlton
When working with GESIS files, the webcat and the thoughtful labelling makes working with a single survey. The data can be analyzed with the help of the generic (English and French) survey questionnaire. However, when working with multiple files, it become apparent that the GESIS files are not coded consistently.
Sometimes the short variable name is the questionnaire item. For example, the generic questionnaire identifies a question as QA11
, which is the 11the question in the question block A
. Or QB6_3
is the 3rd item of the 6th structured in the question block B
. Furthermore, TNS uses some consistent questionnaire item IDs for repeating items, such a demography questions. However GESIS sometimes uses these IDs in the SPSS file, sometimes not, and sometime only in the demography block. When possible, get_eb_questionnaire_item()
retrieves this information and records it in the metadata file. However, this information (and a programmatic connection between the questionnaire PDF files and the SPSS or flat files) is not possible because often the questionnaire ID is not present in the SPSS file.
Sometimes there is a split
in the data file, because certain questions were not asked, or differently asked outside of the EU, for example in the accession countries, or very often in the special territory of the Turkish community of Cyprus. There is also a consistent naming of weights. Whenever possible, this information is taken into consideration in the creation of harmonized long variable names.
We do not deal with these issues, because the short variable names are not consistent and generally we only record them in the metadata file but do not use them.
In some cases, the questionnaire ID is added to the long variable label, for example, QA1 xx
. This is clearly disturbing the variable name harmonization. In this case a regex removes the questionnaire ID and records it in the metadata, similarly to such items if they are found in the short variable names. They are easier to detect in the long variable labels, and this conversion works very well. The harmonized variable name is derived from the rest of the label, after removing the questionnaire ID and the subsequent space or underscore.
Because we want to create variable names that can be used programmatically, we remove all special characters. In most of the cases this means the removal of a hypen. The %
sign is changed to _pct
, the +
sign to _p
and the -
sign to _m
. For example, AGED 15+
becomes aged_15p
.
You can review all current changes in the source code of the var_name_suggest()
function.
Unnecessary whitespace, such as accidental double space is removed from the long variables labels before creating the new variable name.
This is the most problematic part, and in my view, such errors should be corrected, whenever not fully archived, in the GESIS SPSS files. These errors can be detected with hard work and they almost always require human judgement, and often a lot of comparison across large SPSS files. These are rare and easy-to-correct errors, and there is no programmatic solution offered to them. However, very soon almost all of them will be detected at least in the archive of since the Mannheim Trend Files.
Technically speaking this is as problematic as the previous problem, but only in the context of integrated multiple files or creating trend files. These alternative naming will be collected and added as a data file to the R package. Some often used name variations in weights and demography are already discovered and harmonized.
The current vocabulary can be found in the github repository of the package as data-raw/Vocabulary.xlsx
which is converted into the vocabulary file by data-raw/create_vocabulary_rda.R
The author welcomes all suggestions and improvements to the vocabulary file. You can access the vocabulary in R directly with eurobarometer::vocabulary.R
. In the context of 4 non-negative categorical variables you can review the current vocabulary with vocabulary_items_get( context = "factor_4")
.
name | context | neg_2 | neg_1 | neutral | pos_1 | pos_2 | pos_3 | 36 | 12 | 3 | 0 | 6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
attached_4_1 | factor_4 | NA | NA | Not at all attached | Not very attached | Fairly attached | Very attached | NA | NA | NA | NA | NA |
confident_4_1 | factor_4 | NA | NA | Not at all confident | Not very confident | Fairly confident | Very confident | NA | NA | NA | NA | NA |
frequency_4_1 | factor_4 | NA | NA | Never | Rarely | From time to time | Often | NA | NA | NA | NA | NA |
good_4_1 | factor_4 | NA | NA | Bad | Average | Fair | Excellent | NA | NA | NA | NA | NA |
important_4_1 | factor_4 | NA | NA | Not at all important | Not very important | Fairly important | Very important | NA | NA | NA | NA | NA |
important_4_2 | factor_4 | NA | NA | Not at all important | Not very important | Important | Very important | NA | NA | NA | NA | NA |
These factors have only postive values, or rather, meanings, which are coded as 0
, 1
, 2
, 3
.
name | context | neg_2 | neg_1 | neutral | pos_1 | pos_2 | pos_3 | 36 | 12 | 3 | 0 | 6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
amount_3_1 | factor_pos_neg | NA | Doing too much | Doing about the right amount | Not doing enough | NA | NA | NA | NA | NA | NA | NA |
amount_3_2 | factor_pos_neg | NA | Too much | About the right amount | Not enough | NA | NA | NA | NA | NA | NA | NA |
applies_3_1 | factor_pos_neg | NA | Applies fairly badly | Neither | Applies fairly well | NA | NA | NA | NA | NA | NA | NA |
better_3_1 | factor_pos_neg | NA | Worse | Same | Better | NA | NA | NA | NA | NA | NA | NA |
change_3_1 | factor_pos_neg | NA | Less important | No change / As it is now | More important | NA | NA | NA | NA | NA | NA | NA |
direction_3_1 | factor_pos_neg | NA | Wrong direction | Neither | Right direction | NA | NA | NA | NA | NA | NA | NA |
These factors are coded as -1
, 0
, 1
. This coding is meaningful with logical operations, for example.
The next obstacle of integrating data files is that identical questionnaire answer options are coded differently. For example, women respondents are labeled as woman
or female
. The solution here is the creation of a vocabulary file that harmonizes identical answers across files.
This is far the most common obstacle, and it can be resolved with the creation of the vocabulary file. For example, in the type of community
demography questions the answer option Small or medium-sized town
is sometimes coded as is, and sometimes as Small/middle town
. This variable can be converted into a character or a factor type. Character vectors have no ordinal values, i.e. they are not ordered, and factors are ordered. Soon there will be a common factor and character representation for these variables, but currently a different route is taken, they are unambiguously converted to numbers. Numerical conversion has several advantages. For example, it allows easy integration with non-GESIS data files that use different natural languages or character sets. They can also facilitate the harmonization of character and factor values.
These issues require judgement, and eventually may lead to different interpretations. The vocabulary in this respect was rather cautiously harmonized.
The current Eurobarometer questionnaires have two generic versions, English and French. It appears that sometimes the true master language is English, sometimes French, and in the translation of the questionnaire item synonyms are used interchangeably. I believe that in the future the company making the surveys should try to avoid these divergences, but in many cases, they are hardly problematic, because the generic questions are translated to other natural languages, anyway.
Consider the following categorical ranges
Not at all informed
, Not very informed
, Fairly well informed
, Very well informed
Not at all informed
, Not very well informed
, Fairly well informed
, Very well informed
In my view these items are identical, and they are most likely translated identically to other languages. What is important that Not very informed
and Not very well informed
are consistently placed in the ordering of 2nd (ascending) or 3rd (descending). The package uses the ascending sorting, which is, if necessary, can be easily reverted programatically.
The ultimate aim of the package is to create panel data or trend files, i.e. files that contain the same survey parts for all countries in different points of time. With some questions, this is already possible with the package, and some trend files will be published soon.
Considerable judgement is needed until when two questions are identical, given the two original source languages and the more than 20 language versions. The package can help this in two ways:
Creation of combined metadata files and highlighting the repeating questions, which are obvious in the case of questions created for trending purposes and demography, but not obvious in other repeated questions. The metadata files created by gesis_file_read()
and analyze_file_metadata()
with its helper functions solve this problem. While question identification can be further developed, it also raises the question of variable name harmonization, where the inputs of GESIS and other users would be helpful to create an optimal solution.
Creation of the comprehensive vocabulary of answer items. class_conversion_suggest()
in fact helps the creation fo the vocabulary because it analyzes the content of variables in individual SPSS files to classify them to a common type, for example, the factor_binomial
class, which contains all the repeating binary questionnaire items, such as agree-disagree, female-male, for-against. Finding new patterns and aligning them with the current classes factor_3
, factor_4
, factor_5
, factor_pos_neg
, factor_pos_neg_4
, factor_frequency
, and the almost numerical and date formats help the development of the vocabulary.
Easy-to-join, harmonized, flat data files, which are after or what this package creates. I believe that in this respect the package already contributes a lot research efforts, because it appears to be working with few problems, and at the moment no recognizable errors. But the first two problems would require consultations with other users and GESIS.
Currently the package correctly harmonizes about 75% of the last decades questions, but still struggles with identifying identical questions. The best approach is to realize the current version of the package, and allow other researchers to contribute to the vocabulary, and to discuss naming conventions.
The vocabulary should be centralized, and frequently redistributed in new versions of the package, so any suggestions are welcome. I created an Excel version of the vocabulary so that it is easy to make suggestions. This part of the package is very easy to maintain, so in the first year the vocabulary file can be updated in every few weeks.
Variable naming would need a consensus of users, and some consultation with GESIS. The metadata files created by this package can greatly facilitate the correction of small errors, and the adoption of more comprehensive practices that will eventually lead to easier to use SPSS and STATA files, too.
The programmatic longitudinal integration of the data should be based on the date of interview variable. This is an important metadata, because the time difference between repeated questions can be measured by the typical response days. Another problem that in rare occasions certain country survey waves for whatever practical reason are held in a different month than the bulk of the Eurobarometer survey.
As far as I see, some SPSS files do not have a date variable at all (or they are coded very differently from most files.) In my view, this may be a file conversion error that happened in the stage of creating the GESIS archive files, and, if the data is available, must be corrected in all cases.
My original motivation is to integrate Eurobarometer surveys via standardized questions with national surveys. International comparison just raises issues with comparability a step further than researchers of the Eurobarometer archives. Some questions had been similar, but slightly altered over the five decades of surveying, and translations offered other deviations.
I am very much interested in any contribution to this issue, in the creation of multi -language vocabularies and question data banks.
One serious problem now is although all national language questionnaires are available, they are raw pdf files, and any compilation programmatically appears to be a very great task. But eventually the creation of a standardized translation vocabulary would help the creation of new comparable surveys, data products and analytical work. I have experience with using the standardized questions of the European Cultural Access and Participation surveys, and I will include some examples in a separate vignette.
While the Eurobarometer surveys have a very well developed and useful demography that allows international comparisons, the sub-national level of information varies country-by-country, partly following local standards, and partly due to the different sizes of countries.
Technically it possible to create integration via urbanization, or at sub-national level with some limits, similarly to the alternative demographical organization of cohorts, that allow meaningful time-wise comparisons.
It would be very interesting to find collaborators in this field, because the creation of sub-national statistics is not extremely challenging from a programmatic point of view, but requires a thorough understanding of the statistical regions and administrative systems of various countries.
An interesting applicaton could be the creation of trend files for larger urban areas, such as London, Paris, Berlin or Budapest that are significantly big to have large subsamples in the national samples.