R training: Analysis of survey data with survey and srvyr

1. Setup

Install package: Before you can use survey or srvyr, you need to need to install the packages (only once per computer). Run the following code (in R):
install.packages("survey")
install.packages("srvyr")

1.1 Load packages

Every time you run an R session using the survey or srvyr packages, you will have to call (“activate”) whichever you are using so you can use the respective formulas. If you use srvyr, you also need activate dplyr as srvyr uses its syntax.

library(survey)
library(srvyr)
library(dplyr)

1.2 Load data

Next, load in your data. Define a new object (here called “data”) and refer to the dataset using the base R function read.csv(). Define your NA strings (denoting empty cells) as needed, and specify which character the csv uses as separator.

data <- read.csv("data/example_data.csv", na.strings = c("NA", ""), sep = ",")

If your sample needs to be weighted, you also have to load in a sampling frame, which defines the population estimates of your strata/clusters. Make sure that the spelling of the strata names matches the strata variable, which we will define in the next step.

sampling_frame <- read.csv("data/sampling_frame.csv", sep = ",")

Here is what our example dataset and sampling frame look like:

Dataset

start	end	today	geographic_region	governorate	district	nationality	camp_out_of_camp	age_respondent	gender_respondent	size_hh	calc_total_income	calc_total_expenditure	sources_food	food_security_index	accomodation_type	primary_livelihood	primary_livelihood.employment	primary_livelihood.loans_debts	primary_livelihood.NGO_charity_assistance	primary_livelihood_other	X_uuid	strata	population
2020-09-04T02:10:03.782+03:00	2020-09-04T14:43:12.323+03:00	04/09/2020	NA	al_sulaymaniyah	al_sulaymaniyah	iran	NA	60	female	4	350000	375000	credit	3	House	employment loans_debts	1	1	0	NA	0009c45d-2940-45ff-aa1e-5fce6ac6bf57	iran_NA_NA	3915
2020-09-16T13:18:42.651+03:00	2020-09-16T13:59:56.989+03:00	16/09/2020	al_sulaymaniyah	al_sulaymaniyah	al_sulaymaniyah	syria	camp	35	male	5	510000	310000	credit	2	NA	employment loans_debts NGO_charity_assistance	1	1	1	NA	000fc018-03d2-4e57-9c0c-21b27fb929cd	syria_al_sulaymaniyah_camp	2163
2020-09-13T09:45:35.867+03	2020-09-13T10:24:14.564+03	13/09/2020	erbil	erbil	erbil	syria	camp	33	male	6	232000	257000	credit	2	NA	employment loans_debts NGO_charity_assistance	1	1	1	NA	0027aa1c-d305-4d6e-8134-6b654e817d4a	syria_erbil_camp	6901
2020-09-28T09:19:25.797+03:00	2020-09-28T10:04:22.460+03:00	28/09/2020	duhok	duhok	duhok	syria	out_of_camp	32	male	3	200000	610000	credit	1	Apartment	employment	1	0	0	NA	00a23f2f-7b8b-412a-8e98-e69eee0c7d2b	syria_duhok_out_of_camp	8819
2020-09-03T15:16:30.371+03:00	2020-09-03T15:57:11.240+03:00	03/09/2020	NA	erbil	erbil	turkey	NA	51	male	4	420000	395000	gift_family	2	House	employment loans_debts NGO_charity_assistance	1	1	1	NA	00a3ac66-e7c1-4576-8288-501e4f2d9d53	turkey_NA_NA	2470
2020-09-17T10:05:20.360+03:00	2020-09-17T10:49:10.425+03:00	17/09/2020	al_sulaymaniyah	al_sulaymaniyah	al_sulaymaniyah	syria	out_of_camp	24	male	1	300000	235000	cash_own	1	House	employment	1	0	0	NA	00fc88ac-864f-4e9e-9bb9-55a25e8f86ca	syria_al_sulaymaniyah_out_of_camp	7525

Sampling frame

strata.names	population
iran_NA_NA	3915
palestinian_territories_NA_NA	2916
turkey_NA_NA	2470
syria_al_sulaymaniyah_camp	2163
syria_al_sulaymaniyah_out_of_camp	7525
syria_centre_south_out_of_camp	1127

1.3 Define strata variable

Next, you need to define a variable specifying your strata (if you do not already have such a variable in the dataset). Define a new variable by first calling the name of the dataset (data), followed by $ and then the name of the new variable (strata). The value you give this new variable is defined by the paste() function, which pastes values from multiple columns together as needed, separated by a character as defined.

data$strata <- paste(data$nationality,
                     data$geographic_region,
                     data$camp_out_of_camp,
                     sep = "_"
                     )

Check out your dataset and double-check that the spelling of the newly defined variable and the strata names in your sampling frame match exactly. If they do not, you will run into problems in the next step.

1.4 Join sampling frame to dataset

To join the two dataframes together, use the left_join() function from the dplyr package.

data <- left_join(data, sampling_frame, by = c("strata" = "strata.names"))

Note that steps 1.3 and 1.4 could easily also be done in Excel. However, it is best practice to document every aggregation step and therefore do everything in R.

2. Analysis with survey

Everything above was just preparation. Only now we are actually getting into the survey package.

2.1 Define survey design

In a first step, define the survey design using the svydesign() function. Define a new object (“design_survey”) like so:

design_survey <- svydesign(data = data, id =~1, strata = ~strata, fpc = ~population)

The id argument defines the clusters IDs (from largest to smallest). In the case of stratified sampling only, leave the argument empty by specifying ~1. Define the strata argument as the column you specified above with the strata names (preceeded by ~). fpc stands for “finite population correct”, and specifies the column in which you have your population estimates. Using these estimates, survey will calculate weighted statistics.

2.2 Without disaggregation

Now that the design object has been defined, we can continue and run some calculations with our sample data.

For numerical and categorical variables you call the svymean() function. Call the indicator you wish to analyse (preceeded by ~), as well as the survey design object. Add na.rm = TRUE if you want to exclude all empty values (NA).

Example with numerical variable (household size):

svymean(~size_hh, design_survey, na.rm = TRUE)

##           mean  SE
## size_hh 4.8303 0.1

Example with categorical variable (gender):

svymean(~gender_respondent, design_survey, na.rm = TRUE)

##                            mean     SE
## gender_respondentfemale 0.13246 0.0131
## gender_respondentmale   0.86754 0.0131

2.3 With disaggregation

For disaggregated figures, use the svyby() function. From within, you call the indicators, survey design, and aggregation function (“svymean”).

Example with numerical variable:

svyby(~calc_total_income, ~nationality, design_survey, svymean)

##                                     nationality calc_total_income        se
## iran                                       iran          330358.7  8310.646
## palestinian_territories palestinian_territories          518851.2 12418.912
## syria                                     syria          423404.0 16221.628
## turkey                                   turkey          422513.9 20247.155

3. Analysis with srvyr

3.1 Define survey design

First, you have to define the survey design with the as_survey_design() formula.

design_srvyr <- data %>% as_survey_design(ids = 1, strata = strata, fpc = population)

As you can see, it looks very similar to the equivalent function from the survey package.

Check out the as_survey_design() documentation to see which other arguments can be put in the expression.

3.2 Numerical variable - without disaggregation

In the case of a numerical variable, you define a new object (res_1) and define it as our survey design object (design).

Next is where the dplyr syntax comes in: Add a summarise() statements and “pipe” the expressions together using the pipe operator (%>%). Within the summarise() statement, you define a new variable (“mean”). This new variable is defined using the survey_mean() function from srvyr. Inside the expression you call the indicator you want to calculate the mean of (e.g. “size_hh”).

Include na.rm = TRUE to exclude missing values.

result <- design_srvyr %>%
  summarise(mean = survey_mean(size_hh, na.rm = TRUE))

Call the object to have a look at the results.

result

##      mean   mean_se
## 1 4.83025 0.1000442

As you can see, the average household size (weighted across the strata based on population estimates) in our sample data is 4.8. By default, the output also includes the standard error.

3.3 Numerical variable - with disaggregation

Let’s say you wanted to disaggregate your findings by another variable. In this case, the syntax will include a group_by() statement, in which you specify the name of the variable by which you wish to disaggregate the results. In the following example, we are calculating mean incomes per nationality.

design_srvyr %>%
  group_by(nationality) %>%
  summarise(mean = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"))

## # A tibble: 4 x 4
##   nationality                mean mean_low mean_upp
##   <chr>                     <dbl>    <dbl>    <dbl>
## 1 iran                    330359.  314059.  346658.
## 2 palestinian_territories 518851.  494494.  543208.
## 3 syria                   423404.  391589.  455219.
## 4 turkey                  422514.  382803.  462224.

Note that in the results output there are now 4 columns: 1) disaggregations, 2) mean, 3) lower bound of the confidence interval, and 4) upper bound of the confidence interval (0.95 confidence level by default).

The confidence intervals (CI) can be helpful to quickly determine whether differences between groups are statistically significant. If the CIs do not overlap, the difference is statistically significant. If the CIs do overlap, you might need to run a significance test to know.

3.4 Categorical variable - without disaggregation

In the case of categorical variables, the syntax is slightly different. Rather than defining the variable in the summarise() statement, you move it up and call in the group_by() expression.

design_srvyr %>%
  group_by(gender_respondent) %>%
  summarise(mean = survey_mean(vartype = "ci"))

## # A tibble: 2 x 4
##   gender_respondent  mean mean_low mean_upp
##   <chr>             <dbl>    <dbl>    <dbl>
## 1 female            0.132    0.107    0.158
## 2 male              0.868    0.842    0.893

3.5 Categorical variable - with disaggregation

If you want to add disaggregations to a case with a categorical variable, simply add the variable name to the group_by() statement. If you want to exclude empty cells (NA) from your variable of interest, or disaggregation variable, remove them using the filter() expression.

design_srvyr %>%
  filter(!is.na(camp_out_of_camp)) %>%
  group_by(camp_out_of_camp, sources_food) %>%
  summarise(mean = survey_mean(vartype = "ci"))

## # A tibble: 12 x 5
## # Groups:   camp_out_of_camp [2]
##    camp_out_of_camp sources_food        mean  mean_low mean_upp
##    <chr>            <chr>              <dbl>     <dbl>    <dbl>
##  1 camp             assistance_local 0.00594 -0.00568   0.0176 
##  2 camp             cash_assistance  0.0576   0.0286    0.0866 
##  3 camp             cash_own         0.461    0.396     0.527  
##  4 camp             credit           0.466    0.400     0.532  
##  5 camp             gift_family      0.00890 -0.00407   0.0219 
##  6 out_of_camp      assistance_un    0.00568 -0.00545   0.0168 
##  7 out_of_camp      cash_assistance  0.0275   0.00689   0.0481 
##  8 out_of_camp      cash_own         0.685    0.626     0.744  
##  9 out_of_camp      credit           0.253    0.198     0.307  
## 10 out_of_camp      gift_family      0.0165  -0.000148  0.0331 
## 11 out_of_camp      in_kind_labour   0.0114  -0.00431   0.0270 
## 12 out_of_camp      vouchers_PDS     0.00132  0.000234  0.00241

3.6 Adding counts

In addition to having your sample means and proportions, you might be interested in having an additional column in your output indicating the unweighted counts (i.e. number of respondents).

For categorical variables, specify a new variable (“count”) in the summarise() statement and define it as the number of entries using n(), and wrap a unweighted() function around it to make sure the results do not get weighted.

design_srvyr %>%
  group_by(gender_respondent) %>%
  summarise(mean  = survey_mean(vartype = "ci"),
            count = unweighted(n()))

## # A tibble: 2 x 5
##   gender_respondent  mean mean_low mean_upp count
##   <chr>             <dbl>    <dbl>    <dbl> <int>
## 1 female            0.132    0.107    0.158   325
## 2 male              0.868    0.842    0.893  1475

For numerical variables, the syntax is slightly different:

design_srvyr %>%
  summarise(mean  = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"),
            count = unweighted(sum(!is.na(calc_total_income))))

##     mean mean_low mean_upp count
## 1 422302 394134.1 450469.9  1800