The dplyr package is part of the very popular tidyverse collection, which includes a number of very useful packages (such as dplyr, tidyr and ggplot2). dplyr allows you to quickly add/drop columns, filter data, mutate existing data, as well as summarize data. Given its speed, and relatively simple syntax, it is preferred by many R developers as the workhorse for data wrangling. Some of the most popular functions from the package are filter(), group_by() and summarise().
This tutorial goes over the most important functions and illustrates them using an example dataset from the 2020 MSNA in Iraq.
Data wrangling: Data wrangling is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
The package includes many different functions. The most common ones are the following:
select(): Pick columns based on their names and/or typesfilter(): Pick rows based on conditions about their valuesmutate(): Create a new variable by mutating existing onescase_when(): To create statements with if/else logicgroup_by(): Group rows of observations togethersummarise(): Compute summary statistics of variablesjoin(): Merge two dataframes togetherOther useful functions are:
rename(): Rename columns in your datasetarrange(): Order rows by values of a columndistinct(): Remove rows with duplicate valuesbind_rows() / bind_cols(): Paste tables below/besides each otherHave a look at the dplyr cheat sheet for more functions.
%>%The pipe operator (%>%) is unique to the dplyr package and allows you to chain together different functions. The pipe operator can be read as “then” - it allows us to go from one step to the next.
For example, we can:
filter our data frame to only focus on a few rows (e.g. only keep refugee households) thengroup_by another variable to create groups (e.g. group by governorate) thensummarise this grouped data to calculate the mean for each level of the group. (e.g. derive average income of refugee households by governorate)All the functions below will be illustrated on an example dataset, which looks like the following:
| start | end | today | geographic_region | governorate | district | nationality | camp_out_of_camp | strata | age_respondent | gender_respondent | size_hh | calc_total_income | calc_total_expenditure | sources_food | food_security_index | accomodation_type | primary_livelihood | primary_livelihood.savings | primary_livelihood.employment | primary_livelihood.remittances | primary_livelihood.retirement_pension | primary_livelihood.selling_assets | primary_livelihood.selling_assistance_received | primary_livelihood.loans_debts | primary_livelihood.MODM_cash_assistance | primary_livelihood.support_from_community | primary_livelihood.NGO_charity_assistance | primary_livelihood.social_service | primary_livelihood.mosque_church | primary_livelihood.illegal_activity | primary_livelihood.other | primary_livelihood_other | X_uuid |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-09-04T02:10:03.782+03:00 | 2020-09-04T14:43:12.323+03:00 | 04/09/2020 | NA | al_sulaymaniyah | al_sulaymaniyah | iran | NA | iran_NA_NA | 60 | female | 4 | 350000 | 375000 | credit | 3 | House | employment loans_debts | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 0009c45d-2940-45ff-aa1e-5fce6ac6bf57 |
| 2020-09-16T13:18:42.651+03:00 | 2020-09-16T13:59:56.989+03:00 | 16/09/2020 | al_sulaymaniyah | al_sulaymaniyah | al_sulaymaniyah | syria | camp | syria_al_sulaymaniyah_camp | 35 | male | 5 | 510000 | 310000 | credit | 2 | NA | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 000fc018-03d2-4e57-9c0c-21b27fb929cd |
| 2020-09-13T09:45:35.867+03 | 2020-09-13T10:24:14.564+03 | 13/09/2020 | erbil | erbil | erbil | syria | camp | syria_erbil_camp | 33 | male | 6 | 232000 | 257000 | credit | 2 | NA | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 0027aa1c-d305-4d6e-8134-6b654e817d4a |
| 2020-09-28T09:19:25.797+03:00 | 2020-09-28T10:04:22.460+03:00 | 28/09/2020 | duhok | duhok | duhok | syria | out_of_camp | syria_duhok_out_of_camp | 32 | male | 3 | 200000 | 610000 | credit | 1 | Apartment | employment | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 00a23f2f-7b8b-412a-8e98-e69eee0c7d2b |
| 2020-09-03T15:16:30.371+03:00 | 2020-09-03T15:57:11.240+03:00 | 03/09/2020 | NA | erbil | erbil | turkey | NA | turkey_NA_NA | 51 | male | 4 | 420000 | 395000 | gift_family | 2 | House | employment loans_debts NGO_charity_assistance | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NA | 00a3ac66-e7c1-4576-8288-501e4f2d9d53 |
| 2020-09-17T10:05:20.360+03:00 | 2020-09-17T10:49:10.425+03:00 | 17/09/2020 | al_sulaymaniyah | al_sulaymaniyah | al_sulaymaniyah | syria | out_of_camp | syria_al_sulaymaniyah_out_of_camp | 24 | male | 1 | 300000 | 235000 | cash_own | 1 | House | employment | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | 00fc88ac-864f-4e9e-9bb9-55a25e8f86ca |
select()Consider the above example dataset (“data”) downloaded from KoBo, and let’s assume we only want to keep certain columns and drop all unnecessary ones. There are several ways to go about this.
You can call columns you wish to keep by name. Simply define a new dataframe (or overwrite the existing one) as the old one, and then “pipe” the select() function to the expression. In the following example, we only want to keep the age and gender variables (age_respondent and gender_respondent).
If you want to look at the first few rows of the object we just defined above (“data_select”), simply call the object and wrap it in a head() function:
## age_respondent gender_respondent
## 1 60 female
## 2 35 male
## 3 33 male
## 4 32 male
## 5 51 male
## 6 24 male
If you wanted to include a range of columns, you could also do it like this (using “:” to define the range):
## governorate district nationality camp_out_of_camp
## 1 al_sulaymaniyah al_sulaymaniyah iran <NA>
## 2 al_sulaymaniyah al_sulaymaniyah syria camp
## 3 erbil erbil syria camp
## 4 duhok duhok syria out_of_camp
## 5 erbil erbil turkey <NA>
## 6 al_sulaymaniyah al_sulaymaniyah syria out_of_camp
Alternatively, you could exclude columns by name. Simply add a minus (“-”) sign in front of the respective variables. Let’s drop all the KoBo meta data columns like this:
You can also add functions to the select() statememt and select columns dynamically. For instance, if you wanted to exclude the multiple choice indicator “primary_livelihood” and all its binary variables (e.g. “primary_livelihood.savings”), you could do so by using starts_with().
If you wanted to subset all numeric columns (i.e. columns including numbers as opposed to characters), you could do so by calling is.numeric function from base R, and wrapping it in a where() function:
If you wanted to reorder the columns in your dataframe, you could do so using the everything() function inside select(), which can be read as “everything else”.
For instance, let’s say we wanted to move the “X_uuid” column to the beginning of the dataframe. We would achieve that with the following:
Check out the select() documentation for more options and examples.
filter()Suppose you have a dataset and you wish to exclude certain entries from your analysis. Let’s assume for instance, we select only the two columns “age_respondent” and “gender_respondent” (as in 3.1.1) but on top of that only keep records of respondents age 30 and below. We achieve this by simply piping a filter() function to the expression, like so:
## age_respondent gender_respondent
## 1 24 male
## 2 26 male
## 3 22 male
## 4 28 male
## 5 20 male
## 6 23 male
All the entries with respondents that are older than 30 years are now removed. Note how we combined the select() and filter() formulas.
Other examples of filter operators and functions you can use are:
==, >, >= etc.& (and), | (or), ! (not)is.na() (element is missing, or NA)between()You can also combine multiple conditions in one statement. The following expression filter the data by out-of camp non-Syrians with a food security index of at least 3.
data_filter <- data %>%
filter(camp_out_of_camp == "out_of_camp" & nationality != "syria" & food_security_index >= 3)More examples and documentation on filter() are found here.
mutate()Now that you know how to subset your dataset in two dimensions, let’s have a look at how you can modify values column-wise or create new variables, using the mutate() function.
Let’s say we wanted to convert the income (“calc_total_income”) and expenditure (“calc_total_expenditure”) variables from Iraqi dinar to USD using the exchange rate of 1,400 USD/IQD. Besides that we also create a new variable (“ratio”) showing the ratio of expenditure over income, which we round to one decimal using the round() function.
data %>%
select(calc_total_income, calc_total_expenditure) %>%
mutate(calc_total_income = calc_total_income / 1400,
calc_total_expenditure = calc_total_expenditure / 1400,
ratio = round(calc_total_expenditure / calc_total_income, 1)) %>%
head()## calc_total_income calc_total_expenditure ratio
## 1 250.0000 267.8571 1.1
## 2 364.2857 221.4286 0.6
## 3 165.7143 183.5714 1.1
## 4 142.8571 435.7143 3.0
## 5 300.0000 282.1429 0.9
## 6 214.2857 167.8571 0.8
In order to mutate multiple columns at once, refer to the following variations:
mutate_all(): to mutate all columns in data framemutate_if(): to mutate column meeting specified criteriamutate_at(): to mutate multiple columns specified in a vars() statementMore on mutate() can be found here.
case_when()The ‘case_when()’ function allows you to modify data based on various possible conditions, and is particularly useful if you want to create new variables or recode existing ones. You can put multiple conditions (or if-else statements) in one expression, which are evaluated in order.
Say we wanted to recode our income variable, and define categories (i.e. “low”, “normal” and “high”) depending on total income (“calc_total_income”). We could specify a case_when() statement within mutate() like this:
data %>%
select(calc_total_income) %>%
mutate(income_category = case_when(calc_total_income > 500000 ~ "high",
calc_total_income > 250000 ~ "normal",
TRUE ~ "low")) %>%
head()## calc_total_income income_category
## 1 350000 normal
## 2 510000 high
## 3 232000 low
## 4 200000 low
## 5 420000 normal
## 6 300000 normal
The function evaluates arguments one after another. This means that you must proceed from the most specific to the most general. In the above example, the last argument (TRUE ~) ensures that if none of the above cases match, “low” is returned. If none of the cases match, and TRUE is not specified as above, NA is returned.
More on case_when() can be found here.
summarise()summarise() reduces multiple values down to a single value, and thus allows you to summarise your dataset. Let’s assume we wanted to know the mean of total income (“calc_total_income”) across all rows, as well as the number of entries in our dataset. We can do so by using the summarise() function as follows:
## mean n
## 1 416134 1800
As you can see in this case, the summarise() statement reduces the dataset to one row and two columns (because we specified two variables).
In order to summarise multiple columns at once, refer to the following variations:
summarise_all(): to summarise all columns in a data framesummarise_if(): to summarise columns meeting specified criteriasummarise_at(): to summarise multiple columns specified in a vars() statementSummarise() allows you to pass any function. Besides some basic ones, such as mean(), median() and max(), you can add any function inside the statement, including non-standard aggregation functions that you defined yourself. This make summarise() very versatile and a key component of your R repertoire.
group_by()The above discussed functions get even more useful when combined with group_by(), which allows you to do group-wise operations. group_by() alone does not do too much by itself, but is incredibly useful when used together with other functions.
Let’s assume we want to once again calculate the mean of “calc_total_income”, but this time we want to get a separate result for each nationality (“nationality”). To do this, we simply pipe a group_by() statement to the expression:
## # A tibble: 4 x 2
## nationality mean
## <chr> <dbl>
## 1 iran 330359.
## 2 palestinian_territories 518851.
## 3 syria 405992.
## 4 turkey 422514.
Using group_by() in tandem with summarise() might be its most straight-forward application. However, you can also do more complex operations with it using other functions. For instance, if we wanted to calculate the deviation of each households’ income from the respective nationality mean, we could do so by running the following code:
data %>%
select(nationality, calc_total_income) %>%
group_by(nationality) %>%
mutate(deviation = calc_total_income - mean(calc_total_income)) %>%
head()## # A tibble: 6 x 3
## # Groups: nationality [3]
## nationality calc_total_income deviation
## <chr> <int> <dbl>
## 1 iran 350000 19641.
## 2 syria 510000 104008.
## 3 syria 232000 -173992.
## 4 syria 200000 -205992.
## 5 turkey 420000 -2514.
## 6 syria 300000 -105992.
Because the rows are grouped by nationality, the mean() statement only evaluates elements within each group.
If you wanted to pipe additional functions to the expression without the grouping dimension (or use another grouping dimension), add a line specifying ungroup() to clear previous grouping specifications. In the following, we are first calculating the mean income per governorate (“governorate”), and then use these means to calculate the country mean:
data %>%
group_by(governorate) %>%
summarise(mean_gov = mean(calc_total_income)) %>%
ungroup() %>%
summarise(mean_country = mean(mean_gov))## # A tibble: 1 x 1
## mean_country
## <dbl>
## 1 484520.
More examples and documentation on group_by() can be found here.
join()The ‘join()’ statement is the go-to function to merge two dataframes (x and y). Let’s consider the following example, where we have a sampling frame (y) with population estimates, which we want to join to our dataset (x).
## strata.names population
## 1 iran_NA_NA 3915
## 2 palestinian_territories_NA_NA 2916
## 3 turkey_NA_NA 2470
## 4 syria_al_sulaymaniyah_camp 2163
## 5 syria_al_sulaymaniyah_out_of_camp 7525
## 6 syria_centre_south_out_of_camp 1127
In this case, we aim to add the column “population” from the sampling frame to the dataset by calling a left_join():
The different join options are the following:
left_join(): includes all rows in x.inner_join(): includes all rows in x and y.right_join(): includes all rows in y.full_join(): includes all rows in x or y.rename()As the name indicates, this function can be used to rename one or multiple columns in the dataset. If we wanted to clean up the name of the income variable (from “calc_total_income” to “income”), we could do so by calling the following:
More documentation and examples are found here.
arrange()Say we wanted to order the rows in our dataframe by values of selected columns. In the following example we first sort by governorate and then by age of the respondent:
More documentation is found here.
distinct()Use distinct() if you want to get rid of duplicate rows in your dataset:
bind_rows() / bind_cols()If you want to combine multiple datasets, use bind_rows() / bind_cols().
Let’s assume you have two different datasets (“data_team_1” and “data_team_2”), which were collected and cleaned by different field teams. Before running analysis, you want to join them together like this;
A full list of additional dplyr functions can be found here.