R training: Data wrangling with dplyr

IMPACT Initiatives - Iraq (Mar 2021)

0. Pretext

The dplyr package is part of the very popular tidyverse collection, which includes a number of very useful packages (such as dplyr, tidyr and ggplot2). dplyr allows you to quickly add/drop columns, filter data, mutate existing data, as well as summarize data. Given its speed, and relatively simple syntax, it is preferred by many R developers as the workhorse for data wrangling. Some of the most popular functions from the package are filter(), group_by() and summarise().

This tutorial goes over the most important functions and illustrates them using an example dataset from the 2020 MSNA in Iraq.

Data wrangling: Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

1. Overview of data wrangling functions

The package includes many different functions. The most common ones are the following:

Other useful functions are:

Have a look at the dplyr cheat sheet for more functions.

2. The Pipe %>%

The pipe operator (%>%) is unique to the dplyr package and allows you to chain together different functions. The pipe operator can be read as “then” - it allows us to go from one step to the next.

For example, we can:

3. Key functions

All the functions below will be illustrated on an example dataset, which looks like the following:

start end today geographic_region governorate district nationality camp_out_of_camp strata age_respondent gender_respondent size_hh calc_total_income calc_total_expenditure sources_food food_security_index accomodation_type primary_livelihood primary_livelihood.savings primary_livelihood.employment primary_livelihood.remittances primary_livelihood.retirement_pension primary_livelihood.selling_assets primary_livelihood.selling_assistance_received primary_livelihood.loans_debts primary_livelihood.MODM_cash_assistance primary_livelihood.support_from_community primary_livelihood.NGO_charity_assistance primary_livelihood.social_service primary_livelihood.mosque_church primary_livelihood.illegal_activity primary_livelihood.other primary_livelihood_other X_uuid
2020-09-04T02:10:03.782+03:00 2020-09-04T14:43:12.323+03:00 04/09/2020 NA al_sulaymaniyah al_sulaymaniyah iran NA iran_NA_NA 60 female 4 350000 375000 credit 3 House employment loans_debts 0 1 0 0 0 0 1 0 0 0 0 0 0 0 NA 0009c45d-2940-45ff-aa1e-5fce6ac6bf57
2020-09-16T13:18:42.651+03:00 2020-09-16T13:59:56.989+03:00 16/09/2020 al_sulaymaniyah al_sulaymaniyah al_sulaymaniyah syria camp syria_al_sulaymaniyah_camp 35 male 5 510000 310000 credit 2 NA employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 000fc018-03d2-4e57-9c0c-21b27fb929cd
2020-09-13T09:45:35.867+03 2020-09-13T10:24:14.564+03 13/09/2020 erbil erbil erbil syria camp syria_erbil_camp 33 male 6 232000 257000 credit 2 NA employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 0027aa1c-d305-4d6e-8134-6b654e817d4a
2020-09-28T09:19:25.797+03:00 2020-09-28T10:04:22.460+03:00 28/09/2020 duhok duhok duhok syria out_of_camp syria_duhok_out_of_camp 32 male 3 200000 610000 credit 1 Apartment employment 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NA 00a23f2f-7b8b-412a-8e98-e69eee0c7d2b
2020-09-03T15:16:30.371+03:00 2020-09-03T15:57:11.240+03:00 03/09/2020 NA erbil erbil turkey NA turkey_NA_NA 51 male 4 420000 395000 gift_family 2 House employment loans_debts NGO_charity_assistance 0 1 0 0 0 0 1 0 0 1 0 0 0 0 NA 00a3ac66-e7c1-4576-8288-501e4f2d9d53
2020-09-17T10:05:20.360+03:00 2020-09-17T10:49:10.425+03:00 17/09/2020 al_sulaymaniyah al_sulaymaniyah al_sulaymaniyah syria out_of_camp syria_al_sulaymaniyah_out_of_camp 24 male 1 300000 235000 cash_own 1 House employment 0 1 0 0 0 0 0 0 0 0 0 0 0 0 NA 00fc88ac-864f-4e9e-9bb9-55a25e8f86ca

3.1 select()

Consider the above example dataset (“data”) downloaded from KoBo, and let’s assume we only want to keep certain columns and drop all unnecessary ones. There are several ways to go about this.

3.1.1 Include variables by name

You can call columns you wish to keep by name. Simply define a new dataframe (or overwrite the existing one) as the old one, and then “pipe” the select() function to the expression. In the following example, we only want to keep the age and gender variables (age_respondent and gender_respondent).

data_select <- data %>%
  select(age_respondent, gender_respondent)

If you want to look at the first few rows of the object we just defined above (“data_select”), simply call the object and wrap it in a head() function:

head(data_select)
##   age_respondent gender_respondent
## 1             60            female
## 2             35              male
## 3             33              male
## 4             32              male
## 5             51              male
## 6             24              male

If you wanted to include a range of columns, you could also do it like this (using “:” to define the range):

head(data) %>%
  select(governorate:camp_out_of_camp)
##       governorate        district nationality camp_out_of_camp
## 1 al_sulaymaniyah al_sulaymaniyah        iran             <NA>
## 2 al_sulaymaniyah al_sulaymaniyah       syria             camp
## 3           erbil           erbil       syria             camp
## 4           duhok           duhok       syria      out_of_camp
## 5           erbil           erbil      turkey             <NA>
## 6 al_sulaymaniyah al_sulaymaniyah       syria      out_of_camp

3.1.2 Exclude variables by name

Alternatively, you could exclude columns by name. Simply add a minus (“-”) sign in front of the respective variables. Let’s drop all the KoBo meta data columns like this:

data_select <- data %>%
  select(-start, -end, -X_uuid)

You can also add functions to the select() statememt and select columns dynamically. For instance, if you wanted to exclude the multiple choice indicator “primary_livelihood” and all its binary variables (e.g. “primary_livelihood.savings”), you could do so by using starts_with().

data_select <- data %>%
  select(-starts_with("primary_livelihood"))

3.1.3 Include/exclude variables by type

If you wanted to subset all numeric columns (i.e. columns including numbers as opposed to characters), you could do so by calling is.numeric function from base R, and wrapping it in a where() function:

data_select <- data %>%
  select(where(is.numeric))

3.1.4 Order columns with everything

If you wanted to reorder the columns in your dataframe, you could do so using the everything() function inside select(), which can be read as “everything else”.

For instance, let’s say we wanted to move the “X_uuid” column to the beginning of the dataframe. We would achieve that with the following:

data_select <- data %>%
  select(X_uuid, everything())

Check out the select() documentation for more options and examples.

back to top

3.2 filter()

Suppose you have a dataset and you wish to exclude certain entries from your analysis. Let’s assume for instance, we select only the two columns “age_respondent” and “gender_respondent” (as in 3.1.1) but on top of that only keep records of respondents age 30 and below. We achieve this by simply piping a filter() function to the expression, like so:

data %>%
  select(age_respondent, gender_respondent) %>%
  filter(age_respondent <= 30) %>%
  head()
##   age_respondent gender_respondent
## 1             24              male
## 2             26              male
## 3             22              male
## 4             28              male
## 5             20              male
## 6             23              male

All the entries with respondents that are older than 30 years are now removed. Note how we combined the select() and filter() formulas.

Other examples of filter operators and functions you can use are:

You can also combine multiple conditions in one statement. The following expression filter the data by out-of camp non-Syrians with a food security index of at least 3.

data_filter <- data %>%
  filter(camp_out_of_camp == "out_of_camp" & nationality != "syria" & food_security_index >= 3)

More examples and documentation on filter() are found here.

back to top

3.3 mutate()

Now that you know how to subset your dataset in two dimensions, let’s have a look at how you can modify values column-wise or create new variables, using the mutate() function.

Let’s say we wanted to convert the income (“calc_total_income”) and expenditure (“calc_total_expenditure”) variables from Iraqi dinar to USD using the exchange rate of 1,400 USD/IQD. Besides that we also create a new variable (“ratio”) showing the ratio of expenditure over income, which we round to one decimal using the round() function.

data %>%
  select(calc_total_income, calc_total_expenditure) %>%
  mutate(calc_total_income      = calc_total_income / 1400,
         calc_total_expenditure = calc_total_expenditure / 1400,
         ratio                  = round(calc_total_expenditure / calc_total_income, 1)) %>%
  head()
##   calc_total_income calc_total_expenditure ratio
## 1          250.0000               267.8571   1.1
## 2          364.2857               221.4286   0.6
## 3          165.7143               183.5714   1.1
## 4          142.8571               435.7143   3.0
## 5          300.0000               282.1429   0.9
## 6          214.2857               167.8571   0.8

In order to mutate multiple columns at once, refer to the following variations:

More on mutate() can be found here.

back to top

3.4 case_when()

The ‘case_when()’ function allows you to modify data based on various possible conditions, and is particularly useful if you want to create new variables or recode existing ones. You can put multiple conditions (or if-else statements) in one expression, which are evaluated in order.

Say we wanted to recode our income variable, and define categories (i.e. “low”, “normal” and “high”) depending on total income (“calc_total_income”). We could specify a case_when() statement within mutate() like this:

data %>%
  select(calc_total_income) %>%
  mutate(income_category = case_when(calc_total_income > 500000 ~ "high",
                                     calc_total_income > 250000 ~ "normal",
                                     TRUE ~ "low")) %>%
  head()
##   calc_total_income income_category
## 1            350000          normal
## 2            510000            high
## 3            232000             low
## 4            200000             low
## 5            420000          normal
## 6            300000          normal

The function evaluates arguments one after another. This means that you must proceed from the most specific to the most general. In the above example, the last argument (TRUE ~) ensures that if none of the above cases match, “low” is returned. If none of the cases match, and TRUE is not specified as above, NA is returned.

More on case_when() can be found here.

back to top

3.5 summarise()

summarise() reduces multiple values down to a single value, and thus allows you to summarise your dataset. Let’s assume we wanted to know the mean of total income (“calc_total_income”) across all rows, as well as the number of entries in our dataset. We can do so by using the summarise() function as follows:

data %>%
  summarise(mean = mean(calc_total_income),
            n    = n())
##     mean    n
## 1 416134 1800

As you can see in this case, the summarise() statement reduces the dataset to one row and two columns (because we specified two variables).

In order to summarise multiple columns at once, refer to the following variations:

Summarise() allows you to pass any function. Besides some basic ones, such as mean(), median() and max(), you can add any function inside the statement, including non-standard aggregation functions that you defined yourself. This make summarise() very versatile and a key component of your R repertoire.

back to top

3.6 group_by()

The above discussed functions get even more useful when combined with group_by(), which allows you to do group-wise operations. group_by() alone does not do too much by itself, but is incredibly useful when used together with other functions.

Let’s assume we want to once again calculate the mean of “calc_total_income”, but this time we want to get a separate result for each nationality (“nationality”). To do this, we simply pipe a group_by() statement to the expression:

data %>%
  group_by(nationality) %>%
  summarise(mean = mean(calc_total_income))
## # A tibble: 4 x 2
##   nationality                mean
##   <chr>                     <dbl>
## 1 iran                    330359.
## 2 palestinian_territories 518851.
## 3 syria                   405992.
## 4 turkey                  422514.

Using group_by() in tandem with summarise() might be its most straight-forward application. However, you can also do more complex operations with it using other functions. For instance, if we wanted to calculate the deviation of each households’ income from the respective nationality mean, we could do so by running the following code:

data %>%
  select(nationality, calc_total_income) %>%
  group_by(nationality) %>%
  mutate(deviation = calc_total_income - mean(calc_total_income)) %>%
  head()
## # A tibble: 6 x 3
## # Groups:   nationality [3]
##   nationality calc_total_income deviation
##   <chr>                   <int>     <dbl>
## 1 iran                   350000    19641.
## 2 syria                  510000   104008.
## 3 syria                  232000  -173992.
## 4 syria                  200000  -205992.
## 5 turkey                 420000    -2514.
## 6 syria                  300000  -105992.

Because the rows are grouped by nationality, the mean() statement only evaluates elements within each group.

If you wanted to pipe additional functions to the expression without the grouping dimension (or use another grouping dimension), add a line specifying ungroup() to clear previous grouping specifications. In the following, we are first calculating the mean income per governorate (“governorate”), and then use these means to calculate the country mean:

data %>%
  group_by(governorate) %>%
  summarise(mean_gov = mean(calc_total_income)) %>%
  ungroup() %>%
  summarise(mean_country = mean(mean_gov))
## # A tibble: 1 x 1
##   mean_country
##          <dbl>
## 1      484520.

More examples and documentation on group_by() can be found here.

back to top

3.7 join()

The ‘join()’ statement is the go-to function to merge two dataframes (x and y). Let’s consider the following example, where we have a sampling frame (y) with population estimates, which we want to join to our dataset (x).

sampling_frame %>%
  head()
##                        strata.names population
## 1                        iran_NA_NA       3915
## 2     palestinian_territories_NA_NA       2916
## 3                      turkey_NA_NA       2470
## 4        syria_al_sulaymaniyah_camp       2163
## 5 syria_al_sulaymaniyah_out_of_camp       7525
## 6    syria_centre_south_out_of_camp       1127

In this case, we aim to add the column “population” from the sampling frame to the dataset by calling a left_join():

data <- left_join(data, sampling_frame, by = c("strata" = "strata.names"))

The different join options are the following:

back to top

4. Additional functions

4.1 rename()

As the name indicates, this function can be used to rename one or multiple columns in the dataset. If we wanted to clean up the name of the income variable (from “calc_total_income” to “income”), we could do so by calling the following:

data %>%
  rename(income = calc_total_income)

More documentation and examples are found here.

4.2 arrange()

Say we wanted to order the rows in our dataframe by values of selected columns. In the following example we first sort by governorate and then by age of the respondent:

data %>%
  arrange(governorate, age_respondent)

More documentation is found here.

4.3 distinct()

Use distinct() if you want to get rid of duplicate rows in your dataset:

data <- distinct(data)

4.4 bind_rows() / bind_cols()

If you want to combine multiple datasets, use bind_rows() / bind_cols().

Let’s assume you have two different datasets (“data_team_1” and “data_team_2”), which were collected and cleaned by different field teams. Before running analysis, you want to join them together like this;

data <- bind_rows(data_team_1, data_team_2)

A full list of additional dplyr functions can be found here.

back to top