R training: Data wrangling with dplyr

3. Key functions

All the functions below will be illustrated on an example dataset, which looks like the following:

start	end	today	geographic_region	governorate	district	nationality	camp_out_of_camp	strata	age_respondent	gender_respondent	size_hh	calc_total_income	calc_total_expenditure	sources_food	food_security_index	accomodation_type	primary_livelihood	primary_livelihood.employment	primary_livelihood.loans_debts	primary_livelihood.NGO_charity_assistance	primary_livelihood_other	X_uuid
2020-09-04T02:10:03.782+03:00	2020-09-04T14:43:12.323+03:00	04/09/2020	NA	al_sulaymaniyah	al_sulaymaniyah	iran	NA	iran_NA_NA	60	female	4	350000	375000	credit	3	House	employment loans_debts	1	1	0	NA	0009c45d-2940-45ff-aa1e-5fce6ac6bf57
2020-09-16T13:18:42.651+03:00	2020-09-16T13:59:56.989+03:00	16/09/2020	al_sulaymaniyah	al_sulaymaniyah	al_sulaymaniyah	syria	camp	syria_al_sulaymaniyah_camp	35	male	5	510000	310000	credit	2	NA	employment loans_debts NGO_charity_assistance	1	1	1	NA	000fc018-03d2-4e57-9c0c-21b27fb929cd
2020-09-13T09:45:35.867+03	2020-09-13T10:24:14.564+03	13/09/2020	erbil	erbil	erbil	syria	camp	syria_erbil_camp	33	male	6	232000	257000	credit	2	NA	employment loans_debts NGO_charity_assistance	1	1	1	NA	0027aa1c-d305-4d6e-8134-6b654e817d4a
2020-09-28T09:19:25.797+03:00	2020-09-28T10:04:22.460+03:00	28/09/2020	duhok	duhok	duhok	syria	out_of_camp	syria_duhok_out_of_camp	32	male	3	200000	610000	credit	1	Apartment	employment	1	0	0	NA	00a23f2f-7b8b-412a-8e98-e69eee0c7d2b
2020-09-03T15:16:30.371+03:00	2020-09-03T15:57:11.240+03:00	03/09/2020	NA	erbil	erbil	turkey	NA	turkey_NA_NA	51	male	4	420000	395000	gift_family	2	House	employment loans_debts NGO_charity_assistance	1	1	1	NA	00a3ac66-e7c1-4576-8288-501e4f2d9d53
2020-09-17T10:05:20.360+03:00	2020-09-17T10:49:10.425+03:00	17/09/2020	al_sulaymaniyah	al_sulaymaniyah	al_sulaymaniyah	syria	out_of_camp	syria_al_sulaymaniyah_out_of_camp	24	male	1	300000	235000	cash_own	1	House	employment	1	0	0	NA	00fc88ac-864f-4e9e-9bb9-55a25e8f86ca

3.1 `select()`

Consider the above example dataset (“data”) downloaded from KoBo, and let’s assume we only want to keep certain columns and drop all unnecessary ones. There are several ways to go about this.

3.1.1 Include variables by name

You can call columns you wish to keep by name. Simply define a new dataframe (or overwrite the existing one) as the old one, and then “pipe” the select() function to the expression. In the following example, we only want to keep the age and gender variables (age_respondent and gender_respondent).

data_select <- data %>%
  select(age_respondent, gender_respondent)

If you want to look at the first few rows of the object we just defined above (“data_select”), simply call the object and wrap it in a head() function:

head(data_select)

##   age_respondent gender_respondent
## 1             60            female
## 2             35              male
## 3             33              male
## 4             32              male
## 5             51              male
## 6             24              male

If you wanted to include a range of columns, you could also do it like this (using “:” to define the range):

head(data) %>%
  select(governorate:camp_out_of_camp)

##       governorate        district nationality camp_out_of_camp
## 1 al_sulaymaniyah al_sulaymaniyah        iran             <NA>
## 2 al_sulaymaniyah al_sulaymaniyah       syria             camp
## 3           erbil           erbil       syria             camp
## 4           duhok           duhok       syria      out_of_camp
## 5           erbil           erbil      turkey             <NA>
## 6 al_sulaymaniyah al_sulaymaniyah       syria      out_of_camp

3.1.2 Exclude variables by name

Alternatively, you could exclude columns by name. Simply add a minus (“-”) sign in front of the respective variables. Let’s drop all the KoBo meta data columns like this:

data_select <- data %>%
  select(-start, -end, -X_uuid)

You can also add functions to the select() statememt and select columns dynamically. For instance, if you wanted to exclude the multiple choice indicator “primary_livelihood” and all its binary variables (e.g. “primary_livelihood.savings”), you could do so by using starts_with().

data_select <- data %>%
  select(-starts_with("primary_livelihood"))

3.1.3 Include/exclude variables by type

If you wanted to subset all numeric columns (i.e. columns including numbers as opposed to characters), you could do so by calling is.numeric function from base R, and wrapping it in a where() function:

data_select <- data %>%
  select(where(is.numeric))

3.1.4 Order columns with everything

If you wanted to reorder the columns in your dataframe, you could do so using the everything() function inside select(), which can be read as “everything else”.

For instance, let’s say we wanted to move the “X_uuid” column to the beginning of the dataframe. We would achieve that with the following:

data_select <- data %>%
  select(X_uuid, everything())

Check out the select() documentation for more options and examples.

3.2 `filter()`

Suppose you have a dataset and you wish to exclude certain entries from your analysis. Let’s assume for instance, we select only the two columns “age_respondent” and “gender_respondent” (as in 3.1.1) but on top of that only keep records of respondents age 30 and below. We achieve this by simply piping a filter() function to the expression, like so:

data %>%
  select(age_respondent, gender_respondent) %>%
  filter(age_respondent <= 30) %>%
  head()

##   age_respondent gender_respondent
## 1             24              male
## 2             26              male
## 3             22              male
## 4             28              male
## 5             20              male
## 6             23              male

All the entries with respondents that are older than 30 years are now removed. Note how we combined the select() and filter() formulas.

Other examples of filter operators and functions you can use are:

==, >, >= etc.
& (and), | (or), ! (not)
is.na() (element is missing, or NA)
between()

You can also combine multiple conditions in one statement. The following expression filter the data by out-of camp non-Syrians with a food security index of at least 3.

data_filter <- data %>%
  filter(camp_out_of_camp == "out_of_camp" & nationality != "syria" & food_security_index >= 3)

More examples and documentation on filter() are found here.

3.3 `mutate()`

Now that you know how to subset your dataset in two dimensions, let’s have a look at how you can modify values column-wise or create new variables, using the mutate() function.

Let’s say we wanted to convert the income (“calc_total_income”) and expenditure (“calc_total_expenditure”) variables from Iraqi dinar to USD using the exchange rate of 1,400 USD/IQD. Besides that we also create a new variable (“ratio”) showing the ratio of expenditure over income, which we round to one decimal using the round() function.

data %>%
  select(calc_total_income, calc_total_expenditure) %>%
  mutate(calc_total_income      = calc_total_income / 1400,
         calc_total_expenditure = calc_total_expenditure / 1400,
         ratio                  = round(calc_total_expenditure / calc_total_income, 1)) %>%
  head()

##   calc_total_income calc_total_expenditure ratio
## 1          250.0000               267.8571   1.1
## 2          364.2857               221.4286   0.6
## 3          165.7143               183.5714   1.1
## 4          142.8571               435.7143   3.0
## 5          300.0000               282.1429   0.9
## 6          214.2857               167.8571   0.8

In order to mutate multiple columns at once, refer to the following variations:

mutate_all(): to mutate all columns in data frame
mutate_if(): to mutate column meeting specified criteria
mutate_at(): to mutate multiple columns specified in a vars() statement

3.4 `case_when()`

The ‘case_when()’ function allows you to modify data based on various possible conditions, and is particularly useful if you want to create new variables or recode existing ones. You can put multiple conditions (or if-else statements) in one expression, which are evaluated in order.

Say we wanted to recode our income variable, and define categories (i.e. “low”, “normal” and “high”) depending on total income (“calc_total_income”). We could specify a case_when() statement within mutate() like this:

data %>%
  select(calc_total_income) %>%
  mutate(income_category = case_when(calc_total_income > 500000 ~ "high",
                                     calc_total_income > 250000 ~ "normal",
                                     TRUE ~ "low")) %>%
  head()

##   calc_total_income income_category
## 1            350000          normal
## 2            510000            high
## 3            232000             low
## 4            200000             low
## 5            420000          normal
## 6            300000          normal

The function evaluates arguments one after another. This means that you must proceed from the most specific to the most general. In the above example, the last argument (TRUE ~) ensures that if none of the above cases match, “low” is returned. If none of the cases match, and TRUE is not specified as above, NA is returned.

More on case_when() can be found here.

3.5 `summarise()`

summarise() reduces multiple values down to a single value, and thus allows you to summarise your dataset. Let’s assume we wanted to know the mean of total income (“calc_total_income”) across all rows, as well as the number of entries in our dataset. We can do so by using the summarise() function as follows:

data %>%
  summarise(mean = mean(calc_total_income),
            n    = n())

##     mean    n
## 1 416134 1800

As you can see in this case, the summarise() statement reduces the dataset to one row and two columns (because we specified two variables).

In order to summarise multiple columns at once, refer to the following variations:

summarise_all(): to summarise all columns in a data frame
summarise_if(): to summarise columns meeting specified criteria
summarise_at(): to summarise multiple columns specified in a vars() statement

Summarise() allows you to pass any function. Besides some basic ones, such as mean(), median() and max(), you can add any function inside the statement, including non-standard aggregation functions that you defined yourself. This make summarise() very versatile and a key component of your R repertoire.

3.6 `group_by()`

The above discussed functions get even more useful when combined with group_by(), which allows you to do group-wise operations. group_by() alone does not do too much by itself, but is incredibly useful when used together with other functions.

Let’s assume we want to once again calculate the mean of “calc_total_income”, but this time we want to get a separate result for each nationality (“nationality”). To do this, we simply pipe a group_by() statement to the expression:

data %>%
  group_by(nationality) %>%
  summarise(mean = mean(calc_total_income))

## # A tibble: 4 x 2
##   nationality                mean
##   <chr>                     <dbl>
## 1 iran                    330359.
## 2 palestinian_territories 518851.
## 3 syria                   405992.
## 4 turkey                  422514.

Using group_by() in tandem with summarise() might be its most straight-forward application. However, you can also do more complex operations with it using other functions. For instance, if we wanted to calculate the deviation of each households’ income from the respective nationality mean, we could do so by running the following code:

data %>%
  select(nationality, calc_total_income) %>%
  group_by(nationality) %>%
  mutate(deviation = calc_total_income - mean(calc_total_income)) %>%
  head()

## # A tibble: 6 x 3
## # Groups:   nationality [3]
##   nationality calc_total_income deviation
##   <chr>                   <int>     <dbl>
## 1 iran                   350000    19641.
## 2 syria                  510000   104008.
## 3 syria                  232000  -173992.
## 4 syria                  200000  -205992.
## 5 turkey                 420000    -2514.
## 6 syria                  300000  -105992.

Because the rows are grouped by nationality, the mean() statement only evaluates elements within each group.

If you wanted to pipe additional functions to the expression without the grouping dimension (or use another grouping dimension), add a line specifying ungroup() to clear previous grouping specifications. In the following, we are first calculating the mean income per governorate (“governorate”), and then use these means to calculate the country mean:

data %>%
  group_by(governorate) %>%
  summarise(mean_gov = mean(calc_total_income)) %>%
  ungroup() %>%
  summarise(mean_country = mean(mean_gov))

## # A tibble: 1 x 1
##   mean_country
##          <dbl>
## 1      484520.

More examples and documentation on group_by() can be found here.

3.7 `join()`

The ‘join()’ statement is the go-to function to merge two dataframes (x and y). Let’s consider the following example, where we have a sampling frame (y) with population estimates, which we want to join to our dataset (x).

sampling_frame %>%
  head()

##                        strata.names population
## 1                        iran_NA_NA       3915
## 2     palestinian_territories_NA_NA       2916
## 3                      turkey_NA_NA       2470
## 4        syria_al_sulaymaniyah_camp       2163
## 5 syria_al_sulaymaniyah_out_of_camp       7525
## 6    syria_centre_south_out_of_camp       1127

In this case, we aim to add the column “population” from the sampling frame to the dataset by calling a left_join():

data <- left_join(data, sampling_frame, by = c("strata" = "strata.names"))

The different join options are the following:

left_join(): includes all rows in x.
inner_join(): includes all rows in x and y.
right_join(): includes all rows in y.
full_join(): includes all rows in x or y.

R training: Data wrangling with dplyr

IMPACT Initiatives - Iraq (Mar 2021)

0. Pretext

1. Overview of data wrangling functions

2. The Pipe `%>%`

3. Key functions

3.1 `select()`

3.1.1 Include variables by name

3.1.2 Exclude variables by name

3.1.3 Include/exclude variables by type

3.1.4 Order columns with everything

3.2 `filter()`

3.3 `mutate()`

3.4 `case_when()`

3.5 `summarise()`

3.6 `group_by()`

3.7 `join()`

4. Additional functions

4.1 `rename()`

4.2 `arrange()`

4.3 `distinct()`

4.4 `bind_rows()` / `bind_cols()`

R training: Data wrangling with dplyr

IMPACT Initiatives - Iraq (Mar 2021)

0. Pretext

1. Overview of data wrangling functions

2. The Pipe %>%

3. Key functions

3.1 select()

3.1.1 Include variables by name

3.1.2 Exclude variables by name

3.1.3 Include/exclude variables by type

3.1.4 Order columns with everything

3.2 filter()

3.3 mutate()

3.4 case_when()

3.5 summarise()

3.6 group_by()

3.7 join()

4. Additional functions

4.1 rename()

4.2 arrange()

4.3 distinct()

4.4 bind_rows() / bind_cols()

2. The Pipe `%>%`

3.1 `select()`

3.2 `filter()`

3.3 `mutate()`

3.4 `case_when()`

3.5 `summarise()`

3.6 `group_by()`

3.7 `join()`

4.1 `rename()`

4.2 `arrange()`

4.3 `distinct()`

4.4 `bind_rows()` / `bind_cols()`