In the last decade, ongoing survey programs opened up their microdata for researchers and data enthusiasts. Millions of interview data became available about how people live and how they perceive the world. Combining these data provides insights into the experiences of world’s societies in the late 20th and early 21st century.
Ex post or retrospective harmonization of surveys refers to various data processing steps that combine existing - often incompatible - data collections to create integrated data sets that can be used to generate meaningful statistical indicators or models. Our new package, retroharmonize, provides tools for these special tasks. In a series of blog posts, we introduce the possibilities and challenges of joining survey data, first, from the same survey programs but from different years, and then from different survey programs.
library(retroharmonize)
library(plyr)
library(tidyverse)
library(scales)
Our first example shows how the internet is taking over television in Africa. We are going to harmonize answers from the seven rounds of Afrobarometer surveys conducted between 2000 and 2018. Afrobarometer is a cross-national survey project measuring attitudes and opinions about democracy, the economy and society in African countries based on general population samples.
Provided that you have downloaded the Afrobarometer files from the Afrobarometer website, you can follow our post. We do not include every single step. The original data files are copyrighted by Afrobarometer, and you must use them as instructed on the website. The authors of this post are not affiliated with Afrobarometer.
The read_surveys function imports the data files into a list.
# get a vector of file names in the selected directory, with .sav extensions
ab <- dir ( afrobarometer_dir, pattern = ".sav$" )
# reorder the list elements to match wave order:
ab <- ab[c(2:5, 1, 6:7)]
# create path for the file names
afrobarometer_rounds <- file.path(afrobarometer_dir, ab)
# read data files to memory
ab_waves <- read_surveys(afrobarometer_rounds, .f='read_spss')
## add a meaningful id for data
for (i in 1:7) {
attr(ab_waves[[i]], "id") <- paste0("Afrobarometer_R", i)
}
if ( attr(ab_waves[[1]], "id") == "Afrobarometer_R1" ) {
## date is missing from R1, it causes problems if you work with it.
ab_waves[[1]]$dateintr <- as.Date ( "2000-06-30")
attr(ab_waves[[1]]$dateintr, "label") <- "Date of interview"
}
Let’s take a look at key characteristics of the Afrobarometer data. The document_waves function returns a table with the file ID, original file name, the number of columns and rows, and the size of the object in memory. The first thing that immediately catches attention is that the Afrobarometer has grown rapidly since its creation in 2000. The data file from the first round contains just 146 variables and just over 21 thousand records, while the file from the last round contains 367 variables and over 45 thousand records, an increase of over 5 fold in terms of data volume.
document_waves(ab_waves)
Now let’s us check out the metadata of the seven survey files. The metadata_create function extracts the variable metadata - primarily variable and value labels - from the labelled data files, and puts them into a tidy table, where each variable is represented by a single row.
Afrobarometer uses to generic questionnaire languages, English and French. The “merged” datafiles, which are cross-sectional merges of the national surveys, and which we use in this blog post, contain the variable information in English, so comparison with the English-language questionnaires is easier.
ab_metadata <- lapply ( X = ab_waves, FUN = metadata_create )
ab_metadata <- do.call(rbind, ab_metadata)
For example, here is a slice of the data with variables on corruption in Afrobarometer Round 1.
slice (ab_metadata, 100:110)
## filename id var_name_orig
## 1 merged_r1_data.sav Afrobarometer_R1 pfpcr1
## 2 merged_r1_data.sav Afrobarometer_R1 pfpcr2
## 3 merged_r1_data.sav Afrobarometer_R1 pfpcr3
## 4 merged_r1_data.sav Afrobarometer_R1 pfpcr4
## 5 merged_r1_data.sav Afrobarometer_R1 pfpcr5
## 6 merged_r1_data.sav Afrobarometer_R1 trspre
## 7 merged_r1_data.sav Afrobarometer_R1 trspol
## 8 merged_r1_data.sav Afrobarometer_R1 trscts
## 9 merged_r1_data.sav Afrobarometer_R1 trsarm
## 10 merged_r1_data.sav Afrobarometer_R1 trsnec
## 11 merged_r1_data.sav Afrobarometer_R1 trsbrd
## class_orig
## 1 retroharmonize_labelled_spss_survey
## 2 retroharmonize_labelled_spss_survey
## 3 retroharmonize_labelled_spss_survey
## 4 retroharmonize_labelled_spss_survey
## 5 retroharmonize_labelled_spss_survey
## 6 retroharmonize_labelled_spss_survey
## 7 retroharmonize_labelled_spss_survey
## 8 retroharmonize_labelled_spss_survey
## 9 retroharmonize_labelled_spss_survey
## 10 retroharmonize_labelled_spss_survey
## 11 retroharmonize_labelled_spss_survey
## label_orig
## 1 handcorr handling corruption
## 2 corrupti bribery not common how many involved in corruption
## 3 correlec corruption among elected leaders
## 4 corrserv corruption among civil servants
## 5 corrwrse corruption worse under old military gov
## 6 trpres trust the president to do what is right
## 7 trpolice trust police
## 8 trcourts trust courts of law
## 9 trarmy trust army
## 10 trelcco trust electoral commission
## 11 trbroad trust national broadcasting
## labels valid_labels na_labels na_range
## 1 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 2 1, 2, 3, 4, 5, 9, 98, 99 1, 2, 3, 4, 5, 9, 98, 99 NA
## 3 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 4 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 5 1, 2, 3, 4, 5, 9, 98, 99 1, 2, 3, 4, 5, 9, 98, 99 NA
## 6 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 7 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 8 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 9 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 10 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## 11 1, 2, 3, 4, 9, 98, 99 1, 2, 3, 4, 9, 98, 99 NA
## n_labels n_valid_labels n_na_labels
## 1 7 7 0
## 2 8 8 0
## 3 7 7 0
## 4 7 7 0
## 5 8 8 0
## 6 7 7 0
## 7 7 7 0
## 8 7 7 0
## 9 7 7 0
## 10 7 7 0
## 11 7 7 0
The first step of ex post harmonization is the selection of variables that have the same meaning, and giving them a harmonized variable name. We do this separately for each concept we are interested in. We start with watching news on TV (On the English version of the questionnaire: How often do you get news from the following sources?
to_harmonize_tv <- ab_metadata %>%
filter ( tolower(var_name_orig) %in%
c("rowid", "dateintr", "country", "region", "withinwt") |
grepl("(tv|television) news", label_orig ) ) %>%
mutate ( var_label = var_label_normalize(label_orig),
# some variable labels require extra adjustments
var_label = ifelse( grepl( "tv\\s*news", var_label), "television news", var_label),
var_label = ifelse( grepl( "region province" , var_label), "province or region", var_label),
var_label = ifelse( grepl( "weight(ing)*" , var_label), "within country weight", var_label)) %>%
mutate ( var_label = case_when (
grepl("^unique identifier", var_label) ~ "unique_id",
TRUE ~ var_label)) %>%
mutate ( var_name = val_label_normalize(var_label))
Let’s check out a few rows only:
slice ( to_harmonize_tv %>%
select ( all_of(c("id", "var_name", "var_label"))), 1:10)
## id var_name var_label
## 1 Afrobarometer_R1 unique_id unique_id
## 2 Afrobarometer_R1 country country
## 3 Afrobarometer_R1 television_news tvnews tv news
## 4 Afrobarometer_R1 within_country_weight within country weight
## 5 Afrobarometer_R2 unique_id unique_id
## 6 Afrobarometer_R2 country country
## 7 Afrobarometer_R2 province_or_region region province
## 8 Afrobarometer_R2 date_of_interview date of interview
## 9 Afrobarometer_R2 television_news television news
## 10 Afrobarometer_R2 within_country_weight within country weight
Now let’s select and subset the internet variables (On the English version of the questionnaire: How often do you get news from the following sources?
to_harmonize_internet <- ab_metadata %>%
filter (
id %in% c("Afrobarometer_R4", "Afrobarometer_R5", "Afrobarometer_R6", "Afrobarometer_R7"),
tolower(var_name_orig) %in%
c("rowid", "dateintr", "country", "region", "withinwt") |
grepl("use( the )*internet", label_orig ) ) %>%
mutate (
var_label = var_label_normalize(label_orig),
# some variable labels require extra adjustments
var_label = ifelse(
grepl( "region\\sprovince" , var_label),
"province or region", var_label),
var_label = ifelse( grepl( "weight(ing)*" , var_label),
"within country weight", var_label)
) %>%
mutate (
var_label = case_when (
grepl("^unique identifier", var_label) ~ "unique_id",
TRUE ~ var_label)
) %>%
mutate ( var_name = val_label_normalize(var_label))
Variables about internet use only start in Afrobarometer Round 4.
slice ( to_harmonize_internet %>%
select ( all_of(c("id", "var_name", "var_label"))), 1:12)
From each data set, we subset only the variables that we selected for harmonization in the previous step. This will make our data small and manageable.
merged_ab_tv <- merge_waves (
waves = ab_waves,
var_harmonization = to_harmonize_tv )
merged_ab_tv <- lapply ( merged_ab_tv,
FUN = function(x) x %>%
mutate_at (
vars(any_of(c("country", "province_or_region"))),
as_character ) )
Applying document_waves shows that we are left with 5-6 variables in each data set.
document_waves(merged_ab_tv)
## # A tibble: 7 x 5
## id filename ncol nrow object_size
## <chr> <chr> <int> <int> <dbl>
## 1 Afrobaromet~ merged_r1_data.sav 5 21531 2592136
## 2 Afrobaromet~ merged_r2_data.sav 6 24301 3127944
## 3 Afrobaromet~ merged_r3_data.sav 6 25397 3270904
## 4 Afrobaromet~ merged_r4_data.sav 6 27713 3570560
## 5 Afrobaromet~ merged-round-5-data-34-countries-2011-20~ 6 51587 11592672
## 6 Afrobaromet~ merged_r6_data_2016_36countries2.sav 6 53935 7803432
## 7 Afrobaromet~ r7_merged_data_34ctry.release.sav 6 45823 6632056
Similarly for the internet variables:
merged_ab_internet <- merge_waves (
# only rounds 4-7 contain variables on internet use
waves = ab_waves[4:7],
var_harmonization = to_harmonize_internet )
merged_ab_internet <- lapply (
merged_ab_internet,
FUN = function(x) x %>%
mutate_at (
vars(any_of(c("country", "province_or_region"))),
as_character ) )
The next step is to bring the data that represent the same answer options from the interview’s questionnaire in the exactly same format. Given that we work with SPSS files, and we want to keep the possibility of re-exporting to SPSS, we created an SPSS-compatible vector type, labelled_spss_survey(). This class is an augmented version of haven::labelled_spss(). It contains a numeric code for the questionnaire item and a corresponding label, for example yes = 1. Furthermore, some answer options such as “refused to say” may be labelled as missing values for numeric manipulations of the data.
First let’s check these missing values:
collect_na_labels( to_harmonize_tv )
## character(0)
Afrobarometer’s SPSS files do not contain user-defined missing values, so we will have to set them later. This information may not have been saved in SPSS, or was not available in earlier file formats.
However, there are several labels that indicate missing data, and they are not harmonized. For example, “Missing data”, “Missing”, “Refused, and”Don’t know".
collect_val_labels (to_harmonize_tv %>%
filter ( grepl( "television", var_name) ))
## [1] "Never"
## [2] "Less than once a month"
## [3] "About once a month"
## [4] "About once a week/few times a month"
## [5] "Several times a week/a few times a week"
## [6] "Every day"
## [7] "Don't know"
## [8] "Refused to answer"
## [9] "Missing data"
## [10] "Missing"
## [11] "A few times a month"
## [12] "A few times a week"
## [13] "Refused"
The harmonize_values() function carries out the strict harmonization of labelled_spss_survey() vectors. While we could loop a well-parametrized version of the generic function, our code is far more readable if we create a version of harmonize_values() by setting its parameters for the television variables:
harmonize_ab_tv <- function(x) {
label_list <- list(
from = c("missing",
"refused",
"don't know",
"every(\\s)*day",
"few times a week|several times a week",
"few times a month|about once a month",
"less than once a month", "never"),
to = c("missing",
"refused",
"do_not_know",
"every_day",
"a_few_times_a_week",
"a_few_times_a_month",
"less_than_once_a_month", "never"),
numeric_values = c(99999,
99998,
99997,
4,
3,
2,
1,
0)
)
harmonize_values(
x,
harmonize_labels = label_list,
na_values = c( "do_not_know" = 99997,
"refused" = 99998,
"missing" = 99999 )
)
}
The label list is a chain of regular expressions that makes sure that all versions of the same category, such as About once a week/few times a month and Few times a month get the same numeric code and label pair.
We also tag the various missing categories to distinguish them from the normal value range and give them unambiguous labels.
The harmonize_waves() function will create a natural full join of the data. In this example, we have only one variable to harmonize. In our next blogpost we will show an example when certain variables are only present in certain waves. In this case harmonize_waves() will make sure that each survey’s table has the same number of columns, and the variables not present in a wave will get a column of appropriately labelled missing values.
harmonized_ab_tv <- harmonize_waves (
waves = merged_ab_tv,
.f = harmonize_ab_tv )
We are looping over the a list of survey tables, and we carry out one harmonization on each of them. You can harmonize in one go as many labelled_spss_survey() vectors as your regex in the label_list() can unambiguously serve.
The structure of the television part of our longitudional data:
h_ab_structure_tv <- attributes(harmonized_ab_tv)
h_ab_structure_tv$row.names <- NULL # We have over 100K row names
h_ab_structure_tv
Let’s repeat these steps with the internet variables. In this case, the same regular expressions would serve the internet and television variables, but it is unnecessary to loop the internet variables in the first survey waves, because they are not present in Round 1,2 and 3.
harmonize_ab_internet <- function(x) {
label_list <- list(
from = c("missing",
"refused",
"don't know",
"every(\\s)*day",
"few times a week|several times a week",
"few times a month|about once a month",
"less than once a month",
"never"),
to = c("missing",
"refused",
"do_not_know",
"every_day",
"a_few_times_a_week",
"a_few_times_a_month",
"less_than_once_a_month",
"never"),
numeric_values = c(99999,
99998,
99997,
4,
3,
2,
1,
0)
)
harmonize_values(
x,
harmonize_labels = label_list,
na_values = c( "do_not_know" = 99997,
"refused" = 99998,
"missing" = 99999 )
)
}
harmonized_ab_internet <- harmonize_waves (
waves = merged_ab_internet,
.f = harmonize_ab_internet )
h_ab_structure_internet <- attributes(harmonized_ab_internet)
h_ab_structure_internet$row.names <- NULL
h_ab_structure_internet
## $names
## [1] "date_of_interview" "unique_id"
## [3] "country" "province_or_region"
## [5] "within_country_weight" "how_often_use_the_internet"
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
##
## $id
## [1] "Waves: Afrobarometer_R4; Afrobarometer_R5; Afrobarometer_R6; Afrobarometer_R7"
##
## $filename
## [1] "Original files: merged_r4_data.sav; merged-round-5-data-34-countries-2011-2013-last-update-july-2015.sav; merged_r6_data_2016_36countries2.sav; r7_merged_data_34ctry.release.sav"
And let’s joint the two tables:
harmonized_ab <- full_join(
harmonized_ab_tv,
harmonized_ab_internet,
by = c("date_of_interview", "unique_id", "country",
"province_or_region", "within_country_weight"))
harmonized_ab <- harmonized_ab %>%
mutate ( year = as.integer(substr(as.character(
date_of_interview),1,4)),
wave = plyr::mapvalues(
year,
from = c(2000, 2002:2018, 1582), # year 1582 for Senegal
to = c(1, 2,2,2,3,3,NA,4,4,NA,5,5,6,6,7,7,8,8, 5)))
blogpost_chart
Internet as a primary news source is on the rise all over Africa. In countries with a high penetration of television sets, internet is partly replacing television news — over time, less people use television every day. In countries with a lower penetration of telecommunication devices, the internet is leapfrogging instanteous information (and disinformation) in large parts of the population.