R training: Analysis of survey data with survey and srvyr

IMPACT Initiatives - Iraq (Mar 2021)

0. Pretext

This training focuses on how to analyze survey data in R. It will teach you step by step how to get population means and proportions from your quantitative datasets, for both unweighted and weighted (clustered/stratified) samples.

This is not a training on HQ’s hypegrammaR package - instead, it uses the popular R packages survey and srvyr. srvyr is built on the well documented survey, but has the added benefit of using syntax that can easily be combined with the workhorse package dplyr and all its functions. In addition, srvyr returns outputs in a standardized format, which may be easier to work with than survey.

1. Setup

Install package: Before you can use survey or srvyr, you need to need to install the packages (only once per computer). Run the following code (in R):

install.packages("survey")
install.packages("srvyr")

1.1 Load packages

Every time you run an R session using the survey or srvyr packages, you will have to call (“activate”) whichever you are using so you can use the respective formulas. If you use srvyr, you also need activate dplyr as srvyr uses its syntax.

library(survey)
library(srvyr)
library(dplyr)

1.2 Load data

Next, load in your data. Define a new object (here called “data”) and refer to the dataset using the base R function read.csv(). Define your NA strings (denoting empty cells) as needed, and specify which character the csv uses as separator.

data <- read.csv("data/example_data.csv", na.strings = c("NA", ""), sep = ",")

If your sample needs to be weighted, you also have to load in a sampling frame, which defines the population estimates of your strata/clusters. Make sure that the spelling of the strata names matches the strata variable, which we will define in the next step.

sampling_frame <- read.csv("data/sampling_frame.csv", sep = ",")

Here is what our example dataset and sampling frame look like:

Dataset

start end today geographic_region governorate district nationality camp_out_of_camp age_respondent gender_respondent size_hh calc_total_income calc_total_expenditure sources_food food_security_index accomodation_type primary_livelihood primary_livelihood.savings primary_livelihood.employment primary_livelihood.remittances primary_livelihood.retirement_pension primary_livelihood.selling_assets primary_livelihood.selling_assistance_received primary_livelihood.loans_debts primary_livelihood.MODM_cash_assistance primary_livelihood.support_from_community primary_livelihood.NGO_charity_assistance primary_livelihood.social_service primary_livelihood.mosque_church primary_livelihood.illegal_activity primary_livelihood.other primary_livelihood_other X_uuid strata population
2020-09-04T02:10:03.782+03:00 2020-09-04T14:43:12.323+03:00 04/09/2020 NA al_sulaymaniyah al_sulaymaniyah iran NA 60 female 4 350000 375000 credit 3 House employment loans_debts 0 1 0 0 0 0 1 0 0 0 0 0 0 0 NA 0009c45d-2940-45ff-aa1e-5fce6ac6bf57 iran_NA_NA 3915
2020-09-16T13:18:42.651+03:00 2020-09-16T13:59:56.989+03:00 16/09/2020 al_sulaymaniyah al_sulaymaniyah al_sulaymaniyah syria camp 35 male 5 510000 310000 credit 2 NA employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 000fc018-03d2-4e57-9c0c-21b27fb929cd syria_al_sulaymaniyah_camp 2163
2020-09-13T09:45:35.867+03 2020-09-13T10:24:14.564+03 13/09/2020 erbil erbil erbil syria camp 33 male 6 232000 257000 credit 2 NA employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 0027aa1c-d305-4d6e-8134-6b654e817d4a syria_erbil_camp 6901
2020-09-28T09:19:25.797+03:00 2020-09-28T10:04:22.460+03:00 28/09/2020 duhok duhok duhok syria out_of_camp 32 male 3 200000 610000 credit 1 Apartment employment 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NA 00a23f2f-7b8b-412a-8e98-e69eee0c7d2b syria_duhok_out_of_camp 8819
2020-09-03T15:16:30.371+03:00 2020-09-03T15:57:11.240+03:00 03/09/2020 NA erbil erbil turkey NA 51 male 4 420000 395000 gift_family 2 House employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 00a3ac66-e7c1-4576-8288-501e4f2d9d53 turkey_NA_NA 2470
2020-09-17T10:05:20.360+03:00 2020-09-17T10:49:10.425+03:00 17/09/2020 al_sulaymaniyah al_sulaymaniyah al_sulaymaniyah syria out_of_camp 24 male 1 300000 235000 cash_own 1 House employment 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NA 00fc88ac-864f-4e9e-9bb9-55a25e8f86ca syria_al_sulaymaniyah_out_of_camp 7525

Sampling frame

strata.names population
iran_NA_NA 3915
palestinian_territories_NA_NA 2916
turkey_NA_NA 2470
syria_al_sulaymaniyah_camp 2163
syria_al_sulaymaniyah_out_of_camp 7525
syria_centre_south_out_of_camp 1127

1.3 Define strata variable

Next, you need to define a variable specifying your strata (if you do not already have such a variable in the dataset). Define a new variable by first calling the name of the dataset (data), followed by $ and then the name of the new variable (strata). The value you give this new variable is defined by the paste() function, which pastes values from multiple columns together as needed, separated by a character as defined.

data$strata <- paste(data$nationality,
                     data$geographic_region,
                     data$camp_out_of_camp,
                     sep = "_"
                     )

Check out your dataset and double-check that the spelling of the newly defined variable and the strata names in your sampling frame match exactly. If they do not, you will run into problems in the next step.

1.4 Join sampling frame to dataset

To join the two dataframes together, use the left_join() function from the dplyr package.

data <- left_join(data, sampling_frame, by = c("strata" = "strata.names"))

Note that steps 1.3 and 1.4 could easily also be done in Excel. However, it is best practice to document every aggregation step and therefore do everything in R.

2. Analysis with survey

Everything above was just preparation. Only now we are actually getting into the survey package.

2.1 Define survey design

In a first step, define the survey design using the svydesign() function. Define a new object (“design_survey”) like so:

design_survey <- svydesign(data = data, id =~1, strata = ~strata, fpc = ~population)

The id argument defines the clusters IDs (from largest to smallest). In the case of stratified sampling only, leave the argument empty by specifying ~1. Define the strata argument as the column you specified above with the strata names (preceeded by ~). fpc stands for “finite population correct”, and specifies the column in which you have your population estimates. Using these estimates, survey will calculate weighted statistics.

2.2 Without disaggregation

Now that the design object has been defined, we can continue and run some calculations with our sample data.

For numerical and categorical variables you call the svymean() function. Call the indicator you wish to analyse (preceeded by ~), as well as the survey design object. Add na.rm = TRUE if you want to exclude all empty values (NA).

Example with numerical variable (household size):

svymean(~size_hh, design_survey, na.rm = TRUE)
##           mean  SE
## size_hh 4.8303 0.1

Example with categorical variable (gender):

svymean(~gender_respondent, design_survey, na.rm = TRUE)
##                            mean     SE
## gender_respondentfemale 0.13246 0.0131
## gender_respondentmale   0.86754 0.0131

2.3 With disaggregation

For disaggregated figures, use the svyby() function. From within, you call the indicators, survey design, and aggregation function (“svymean”).

Example with numerical variable:

svyby(~calc_total_income, ~nationality, design_survey, svymean)
##                                     nationality calc_total_income        se
## iran                                       iran          330358.7  8310.646
## palestinian_territories palestinian_territories          518851.2 12418.912
## syria                                     syria          423404.0 16221.628
## turkey                                   turkey          422513.9 20247.155

3. Analysis with srvyr

3.1 Define survey design

First, you have to define the survey design with the as_survey_design() formula.

design_srvyr <- data %>% as_survey_design(ids = 1, strata = strata, fpc = population)

As you can see, it looks very similar to the equivalent function from the survey package.

Check out the as_survey_design() documentation to see which other arguments can be put in the expression.

3.2 Numerical variable - without disaggregation

In the case of a numerical variable, you define a new object (res_1) and define it as our survey design object (design).

Next is where the dplyr syntax comes in: Add a summarise() statements and “pipe” the expressions together using the pipe operator (%>%). Within the summarise() statement, you define a new variable (“mean”). This new variable is defined using the survey_mean() function from srvyr. Inside the expression you call the indicator you want to calculate the mean of (e.g. “size_hh”).

Include na.rm = TRUE to exclude missing values.

result <- design_srvyr %>%
  summarise(mean = survey_mean(size_hh, na.rm = TRUE))

Call the object to have a look at the results.

result
##      mean   mean_se
## 1 4.83025 0.1000442

As you can see, the average household size (weighted across the strata based on population estimates) in our sample data is 4.8. By default, the output also includes the standard error.

3.3 Numerical variable - with disaggregation

Let’s say you wanted to disaggregate your findings by another variable. In this case, the syntax will include a group_by() statement, in which you specify the name of the variable by which you wish to disaggregate the results. In the following example, we are calculating mean incomes per nationality.

design_srvyr %>%
  group_by(nationality) %>%
  summarise(mean = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"))
## # A tibble: 4 x 4
##   nationality                mean mean_low mean_upp
##   <chr>                     <dbl>    <dbl>    <dbl>
## 1 iran                    330359.  314059.  346658.
## 2 palestinian_territories 518851.  494494.  543208.
## 3 syria                   423404.  391589.  455219.
## 4 turkey                  422514.  382803.  462224.

Note that in the results output there are now 4 columns: 1) disaggregations, 2) mean, 3) lower bound of the confidence interval, and 4) upper bound of the confidence interval (0.95 confidence level by default).

The confidence intervals (CI) can be helpful to quickly determine whether differences between groups are statistically significant. If the CIs do not overlap, the difference is statistically significant. If the CIs do overlap, you might need to run a significance test to know.

3.4 Categorical variable - without disaggregation

In the case of categorical variables, the syntax is slightly different. Rather than defining the variable in the summarise() statement, you move it up and call in the group_by() expression.

design_srvyr %>%
  group_by(gender_respondent) %>%
  summarise(mean = survey_mean(vartype = "ci"))
## # A tibble: 2 x 4
##   gender_respondent  mean mean_low mean_upp
##   <chr>             <dbl>    <dbl>    <dbl>
## 1 female            0.132    0.107    0.158
## 2 male              0.868    0.842    0.893

3.5 Categorical variable - with disaggregation

If you want to add disaggregations to a case with a categorical variable, simply add the variable name to the group_by() statement. If you want to exclude empty cells (NA) from your variable of interest, or disaggregation variable, remove them using the filter() expression.

design_srvyr %>%
  filter(!is.na(camp_out_of_camp)) %>%
  group_by(camp_out_of_camp, sources_food) %>%
  summarise(mean = survey_mean(vartype = "ci"))
## # A tibble: 12 x 5
## # Groups:   camp_out_of_camp [2]
##    camp_out_of_camp sources_food        mean  mean_low mean_upp
##    <chr>            <chr>              <dbl>     <dbl>    <dbl>
##  1 camp             assistance_local 0.00594 -0.00568   0.0176 
##  2 camp             cash_assistance  0.0576   0.0286    0.0866 
##  3 camp             cash_own         0.461    0.396     0.527  
##  4 camp             credit           0.466    0.400     0.532  
##  5 camp             gift_family      0.00890 -0.00407   0.0219 
##  6 out_of_camp      assistance_un    0.00568 -0.00545   0.0168 
##  7 out_of_camp      cash_assistance  0.0275   0.00689   0.0481 
##  8 out_of_camp      cash_own         0.685    0.626     0.744  
##  9 out_of_camp      credit           0.253    0.198     0.307  
## 10 out_of_camp      gift_family      0.0165  -0.000148  0.0331 
## 11 out_of_camp      in_kind_labour   0.0114  -0.00431   0.0270 
## 12 out_of_camp      vouchers_PDS     0.00132  0.000234  0.00241

3.6 Adding counts

In addition to having your sample means and proportions, you might be interested in having an additional column in your output indicating the unweighted counts (i.e. number of respondents).

For categorical variables, specify a new variable (“count”) in the summarise() statement and define it as the number of entries using n(), and wrap a unweighted() function around it to make sure the results do not get weighted.

design_srvyr %>%
  group_by(gender_respondent) %>%
  summarise(mean  = survey_mean(vartype = "ci"),
            count = unweighted(n()))
## # A tibble: 2 x 5
##   gender_respondent  mean mean_low mean_upp count
##   <chr>             <dbl>    <dbl>    <dbl> <int>
## 1 female            0.132    0.107    0.158   325
## 2 male              0.868    0.842    0.893  1475

For numerical variables, the syntax is slightly different:

design_srvyr %>%
  summarise(mean  = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"),
            count = unweighted(sum(!is.na(calc_total_income))))
##     mean mean_low mean_upp count
## 1 422302 394134.1 450469.9  1800

4. Additional functions

With the above examples, we are only scratching on the surface. Have a look at the survey and srvyr documentations to get a full list of functions that the packages provide.

back to top