This training focuses on how to analyze survey data in R. It will teach you step by step how to get population means and proportions from your quantitative datasets, for both unweighted and weighted (clustered/stratified) samples.
This is not a training on HQ’s hypegrammaR package - instead, it uses the popular R packages survey and srvyr. srvyr is built on the well documented survey, but has the added benefit of using syntax that can easily be combined with the workhorse package dplyr and all its functions. In addition, srvyr returns outputs in a standardized format, which may be easier to work with than survey.
Install package: Before you can use survey or srvyr, you need to need to install the packages (only once per computer). Run the following code (in R):
Every time you run an R session using the survey or srvyr packages, you will have to call (“activate”) whichever you are using so you can use the respective formulas. If you use srvyr, you also need activate dplyr as srvyr uses its syntax.
Next, load in your data. Define a new object (here called “data”) and refer to the dataset using the base R function read.csv(). Define your NA strings (denoting empty cells) as needed, and specify which character the csv uses as separator.
If your sample needs to be weighted, you also have to load in a sampling frame, which defines the population estimates of your strata/clusters. Make sure that the spelling of the strata names matches the strata variable, which we will define in the next step.
Here is what our example dataset and sampling frame look like:
| start | end | today | geographic_region | governorate | district | nationality | camp_out_of_camp | age_respondent | gender_respondent | size_hh | calc_total_income | calc_total_expenditure | sources_food | food_security_index | accomodation_type | primary_livelihood | primary_livelihood.savings | primary_livelihood.employment | primary_livelihood.remittances | primary_livelihood.retirement_pension | primary_livelihood.selling_assets | primary_livelihood.selling_assistance_received | primary_livelihood.loans_debts | primary_livelihood.MODM_cash_assistance | primary_livelihood.support_from_community | primary_livelihood.NGO_charity_assistance | primary_livelihood.social_service | primary_livelihood.mosque_church | primary_livelihood.illegal_activity | primary_livelihood.other | primary_livelihood_other | X_uuid | strata | population |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-09-04T02:10:03.782+03:00 | 2020-09-04T14:43:12.323+03:00 | 04/09/2020 | NA | al_sulaymaniyah | al_sulaymaniyah | iran | NA | 60 | female | 4 | 350000 | 375000 | credit | 3 | House | employment loans_debts | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0009c45d-2940-45ff-aa1e-5fce6ac6bf57 | iran_NA_NA | 3915 |
| 2020-09-16T13:18:42.651+03:00 | 2020-09-16T13:59:56.989+03:00 | 16/09/2020 | al_sulaymaniyah | al_sulaymaniyah | al_sulaymaniyah | syria | camp | 35 | male | 5 | 510000 | 310000 | credit | 2 | NA | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 000fc018-03d2-4e57-9c0c-21b27fb929cd | syria_al_sulaymaniyah_camp | 2163 |
| 2020-09-13T09:45:35.867+03 | 2020-09-13T10:24:14.564+03 | 13/09/2020 | erbil | erbil | erbil | syria | camp | 33 | male | 6 | 232000 | 257000 | credit | 2 | NA | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 0027aa1c-d305-4d6e-8134-6b654e817d4a | syria_erbil_camp | 6901 |
| 2020-09-28T09:19:25.797+03:00 | 2020-09-28T10:04:22.460+03:00 | 28/09/2020 | duhok | duhok | duhok | syria | out_of_camp | 32 | male | 3 | 200000 | 610000 | credit | 1 | Apartment | employment | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 00a23f2f-7b8b-412a-8e98-e69eee0c7d2b | syria_duhok_out_of_camp | 8819 |
| 2020-09-03T15:16:30.371+03:00 | 2020-09-03T15:57:11.240+03:00 | 03/09/2020 | NA | erbil | erbil | turkey | NA | 51 | male | 4 | 420000 | 395000 | gift_family | 2 | House | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 00a3ac66-e7c1-4576-8288-501e4f2d9d53 | turkey_NA_NA | 2470 |
| 2020-09-17T10:05:20.360+03:00 | 2020-09-17T10:49:10.425+03:00 | 17/09/2020 | al_sulaymaniyah | al_sulaymaniyah | al_sulaymaniyah | syria | out_of_camp | 24 | male | 1 | 300000 | 235000 | cash_own | 1 | House | employment | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 00fc88ac-864f-4e9e-9bb9-55a25e8f86ca | syria_al_sulaymaniyah_out_of_camp | 7525 |
| strata.names | population |
|---|---|
| iran_NA_NA | 3915 |
| palestinian_territories_NA_NA | 2916 |
| turkey_NA_NA | 2470 |
| syria_al_sulaymaniyah_camp | 2163 |
| syria_al_sulaymaniyah_out_of_camp | 7525 |
| syria_centre_south_out_of_camp | 1127 |
Next, you need to define a variable specifying your strata (if you do not already have such a variable in the dataset). Define a new variable by first calling the name of the dataset (data), followed by $ and then the name of the new variable (strata). The value you give this new variable is defined by the paste() function, which pastes values from multiple columns together as needed, separated by a character as defined.
Check out your dataset and double-check that the spelling of the newly defined variable and the strata names in your sampling frame match exactly. If they do not, you will run into problems in the next step.
To join the two dataframes together, use the left_join() function from the dplyr package.
Note that steps 1.3 and 1.4 could easily also be done in Excel. However, it is best practice to document every aggregation step and therefore do everything in R.
Everything above was just preparation. Only now we are actually getting into the survey package.
In a first step, define the survey design using the svydesign() function. Define a new object (“design_survey”) like so:
The id argument defines the clusters IDs (from largest to smallest). In the case of stratified sampling only, leave the argument empty by specifying ~1. Define the strata argument as the column you specified above with the strata names (preceeded by ~). fpc stands for “finite population correct”, and specifies the column in which you have your population estimates. Using these estimates, survey will calculate weighted statistics.
Now that the design object has been defined, we can continue and run some calculations with our sample data.
For numerical and categorical variables you call the svymean() function. Call the indicator you wish to analyse (preceeded by ~), as well as the survey design object. Add na.rm = TRUE if you want to exclude all empty values (NA).
Example with numerical variable (household size):
## mean SE
## size_hh 4.8303 0.1
Example with categorical variable (gender):
## mean SE
## gender_respondentfemale 0.13246 0.0131
## gender_respondentmale 0.86754 0.0131
For disaggregated figures, use the svyby() function. From within, you call the indicators, survey design, and aggregation function (“svymean”).
Example with numerical variable:
## nationality calc_total_income se
## iran iran 330358.7 8310.646
## palestinian_territories palestinian_territories 518851.2 12418.912
## syria syria 423404.0 16221.628
## turkey turkey 422513.9 20247.155
First, you have to define the survey design with the as_survey_design() formula.
As you can see, it looks very similar to the equivalent function from the survey package.
Check out the as_survey_design() documentation to see which other arguments can be put in the expression.
In the case of a numerical variable, you define a new object (res_1) and define it as our survey design object (design).
Next is where the dplyr syntax comes in: Add a summarise() statements and “pipe” the expressions together using the pipe operator (%>%). Within the summarise() statement, you define a new variable (“mean”). This new variable is defined using the survey_mean() function from srvyr. Inside the expression you call the indicator you want to calculate the mean of (e.g. “size_hh”).
Include na.rm = TRUE to exclude missing values.
Call the object to have a look at the results.
## mean mean_se
## 1 4.83025 0.1000442
As you can see, the average household size (weighted across the strata based on population estimates) in our sample data is 4.8. By default, the output also includes the standard error.
Let’s say you wanted to disaggregate your findings by another variable. In this case, the syntax will include a group_by() statement, in which you specify the name of the variable by which you wish to disaggregate the results. In the following example, we are calculating mean incomes per nationality.
design_srvyr %>%
group_by(nationality) %>%
summarise(mean = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"))## # A tibble: 4 x 4
## nationality mean mean_low mean_upp
## <chr> <dbl> <dbl> <dbl>
## 1 iran 330359. 314059. 346658.
## 2 palestinian_territories 518851. 494494. 543208.
## 3 syria 423404. 391589. 455219.
## 4 turkey 422514. 382803. 462224.
Note that in the results output there are now 4 columns: 1) disaggregations, 2) mean, 3) lower bound of the confidence interval, and 4) upper bound of the confidence interval (0.95 confidence level by default).
The confidence intervals (CI) can be helpful to quickly determine whether differences between groups are statistically significant. If the CIs do not overlap, the difference is statistically significant. If the CIs do overlap, you might need to run a significance test to know.
In the case of categorical variables, the syntax is slightly different. Rather than defining the variable in the summarise() statement, you move it up and call in the group_by() expression.
## # A tibble: 2 x 4
## gender_respondent mean mean_low mean_upp
## <chr> <dbl> <dbl> <dbl>
## 1 female 0.132 0.107 0.158
## 2 male 0.868 0.842 0.893
If you want to add disaggregations to a case with a categorical variable, simply add the variable name to the group_by() statement. If you want to exclude empty cells (NA) from your variable of interest, or disaggregation variable, remove them using the filter() expression.
design_srvyr %>%
filter(!is.na(camp_out_of_camp)) %>%
group_by(camp_out_of_camp, sources_food) %>%
summarise(mean = survey_mean(vartype = "ci"))## # A tibble: 12 x 5
## # Groups: camp_out_of_camp [2]
## camp_out_of_camp sources_food mean mean_low mean_upp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 camp assistance_local 0.00594 -0.00568 0.0176
## 2 camp cash_assistance 0.0576 0.0286 0.0866
## 3 camp cash_own 0.461 0.396 0.527
## 4 camp credit 0.466 0.400 0.532
## 5 camp gift_family 0.00890 -0.00407 0.0219
## 6 out_of_camp assistance_un 0.00568 -0.00545 0.0168
## 7 out_of_camp cash_assistance 0.0275 0.00689 0.0481
## 8 out_of_camp cash_own 0.685 0.626 0.744
## 9 out_of_camp credit 0.253 0.198 0.307
## 10 out_of_camp gift_family 0.0165 -0.000148 0.0331
## 11 out_of_camp in_kind_labour 0.0114 -0.00431 0.0270
## 12 out_of_camp vouchers_PDS 0.00132 0.000234 0.00241
In addition to having your sample means and proportions, you might be interested in having an additional column in your output indicating the unweighted counts (i.e. number of respondents).
For categorical variables, specify a new variable (“count”) in the summarise() statement and define it as the number of entries using n(), and wrap a unweighted() function around it to make sure the results do not get weighted.
design_srvyr %>%
group_by(gender_respondent) %>%
summarise(mean = survey_mean(vartype = "ci"),
count = unweighted(n()))## # A tibble: 2 x 5
## gender_respondent mean mean_low mean_upp count
## <chr> <dbl> <dbl> <dbl> <int>
## 1 female 0.132 0.107 0.158 325
## 2 male 0.868 0.842 0.893 1475
For numerical variables, the syntax is slightly different:
design_srvyr %>%
summarise(mean = survey_mean(calc_total_income, na.rm = TRUE, vartype = "ci"),
count = unweighted(sum(!is.na(calc_total_income))))## mean mean_low mean_upp count
## 1 422302 394134.1 450469.9 1800