Gordon Goodwin
This tutorial will cover the essential basic dplyr functions that create the foundation for efficient data wrangling.
Let’s first create a sample dataset consisting of some basic employee data.
set.seed(999)
dat1 <- tibble(emp_id = sample(100000:200000,
size = 30),
department = c(rep("HR",10),
rep("IT",10),
rep("Sales",10)),
salary_k = round(c(rnorm(10,60,5),
rnorm(10,100,7),
rnorm(10,40,20)),2),
age = sample(20:66,30,T),
gender = sample(c("Male","Female","Non-binary"),30,T,
prob = c(0.45,0.45,0.1)),
business_unit_id = sample(c("B01","AH2"),30,T),
emp_status = sample(c("Active","Termed"),30,T,
prob = c(0.8,0.2)))
head(dat1)## # A tibble: 6 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
We can use filter() to select rows based on conditions. We can use a single condition, multiple conditions, or either/or conditions.
1. Filter for employees over the age of 60:
dat1 %>%
filter(age > 60)## # A tibble: 5 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 151590 HR 49.8 62 Male AH2 Active
## 4 187244 Sales 39.6 63 Male AH2 Active
## 5 140579 Sales 37.9 62 Female B01 Active
2. Filter for employees who identify as Male:
dat1 %>%
filter(gender=="Male")## # A tibble: 9 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 151590 HR 49.8 62 Male AH2 Active
## 2 104586 HR 60.1 57 Male B01 Active
## 3 114776 IT 95.5 39 Male AH2 Active
## 4 150894 IT 112. 35 Male B01 Active
## 5 130977 IT 99.5 44 Male AH2 Active
## 6 139110 IT 103. 40 Male AH2 Active
## 7 150997 IT 93.5 24 Male AH2 Active
## 8 161782 Sales 63.3 25 Male AH2 Active
## 9 187244 Sales 39.6 63 Male AH2 Active
1. Filter for employees who are either belong to the IT department OR have a salary of over $80k:
dat1 %>%
filter(department=="IT" | salary_k > 80)## # A tibble: 10 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 114776 IT 95.5 39 Male AH2 Active
## 2 150894 IT 112. 35 Male B01 Active
## 3 128534 IT 103. 54 Female B01 Active
## 4 130977 IT 99.5 44 Male AH2 Active
## 5 118784 IT 102. 21 Female AH2 Active
## 6 116113 IT 104. 20 Female AH2 Active
## 7 169159 IT 91.0 59 Non-binary AH2 Active
## 8 139110 IT 103. 40 Male AH2 Active
## 9 154657 IT 96.0 57 Female AH2 Active
## 10 150997 IT 93.5 24 Male AH2 Active
2. Filter for employees who are Female AND are in the Sales department:
dat1 %>%
filter(gender=="Female" & department=="Sales")## # A tibble: 7 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 132449 Sales 60.8 28 Female B01 Active
## 2 186174 Sales 17.1 34 Female AH2 Active
## 3 195322 Sales 11.8 29 Female B01 Active
## 4 112459 Sales 34.4 39 Female AH2 Termed
## 5 131186 Sales 31.6 24 Female B01 Active
## 6 181702 Sales 59.9 45 Female AH2 Active
## 7 140579 Sales 37.9 62 Female B01 Active
We can use slice() to “filter” (bad choice of words), or index, rows based on their position. There are also several helper functions that speed up the process and/or allow for more complex manipulation goals.
1. Pull data for the employees in the first 5 rows of our dataset:
dat1 %>%
slice(1:5)## # A tibble: 5 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
2. Pull data for the 3rd and 8th employee:
dat1 %>%
slice(c(3,8))## # A tibble: 2 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 107821 HR 54.4 60 Female AH2 Active
## 2 122418 HR 62.9 28 Female AH2 Active
dat1 %>%
slice_head(n = 3)## # A tibble: 3 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
dat1 %>%
slice_tail(n = 4)## # A tibble: 4 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 112459 Sales 34.4 39 Female AH2 Termed
## 2 131186 Sales 31.6 24 Female B01 Active
## 3 181702 Sales 59.9 45 Female AH2 Active
## 4 140579 Sales 37.9 62 Female B01 Active
dat1 %>%
slice_max(order_by = salary_k, n = 3)## # A tibble: 3 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 150894 IT 112. 35 Male B01 Active
## 2 116113 IT 104. 20 Female AH2 Active
## 3 139110 IT 103. 40 Male AH2 Active
dat1 %>%
slice_min(order_by = age, n = 1)## # A tibble: 2 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 105305 HR 61.5 20 Female B01 Active
## 2 116113 IT 104. 20 Female AH2 Active
dat1 %>%
slice_sample(n = 5)## # A tibble: 5 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 195322 Sales 11.8 29 Female B01 Active
## 2 161782 Sales 63.3 25 Male AH2 Active
## 3 136118 HR 61.4 47 Female B01 Termed
## 4 181702 Sales 59.9 45 Female AH2 Active
## 5 169159 IT 91.0 59 Non-binary AH2 Active
dat1 %>%
slice_sample(prop = .2)## # A tibble: 6 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 130070 HR 59.4 42 Female AH2 Active
## 2 139110 IT 103. 40 Male AH2 Active
## 3 169159 IT 91.0 59 Non-binary AH2 Active
## 4 187244 Sales 39.6 63 Male AH2 Active
## 5 132449 Sales 60.8 28 Female B01 Active
## 6 181702 Sales 59.9 45 Female AH2 Active
Note: With slice_sample(), the replace = TRUE argument can be specified to perform a bootstrapped sample.
We can use arrange() to sort rows by a single variable or by multiple variables in either ascending or descending order. This can also be used for character/categorical variables, not just numerics.
1. Sort employees by age in ascending order:
dat1 %>%
arrange(age)## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 105305 HR 61.5 20 Female B01 Active
## 2 116113 IT 104. 20 Female AH2 Active
## 3 118784 IT 102. 21 Female AH2 Active
## 4 156989 Sales 62.1 23 Non-binary B01 Active
## 5 150997 IT 93.5 24 Male AH2 Active
## 6 131186 Sales 31.6 24 Female B01 Active
## 7 161782 Sales 63.3 25 Male AH2 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 132449 Sales 60.8 28 Female B01 Active
## 10 195322 Sales 11.8 29 Female B01 Active
## # … with 20 more rows
2. Sort employees by department in descending (reverse-alphabetical) order:
dat1 %>%
arrange(desc(department))## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 161782 Sales 63.3 25 Male AH2 Active
## 2 132449 Sales 60.8 28 Female B01 Active
## 3 156989 Sales 62.1 23 Non-binary B01 Active
## 4 187244 Sales 39.6 63 Male AH2 Active
## 5 186174 Sales 17.1 34 Female AH2 Active
## 6 195322 Sales 11.8 29 Female B01 Active
## 7 112459 Sales 34.4 39 Female AH2 Termed
## 8 131186 Sales 31.6 24 Female B01 Active
## 9 181702 Sales 59.9 45 Female AH2 Active
## 10 140579 Sales 37.9 62 Female B01 Active
## # … with 20 more rows
1. Sort by both Salary (descending) and Age (ascending):
dat1 %>%
arrange(desc(salary_k),age)## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 150894 IT 112. 35 Male B01 Active
## 2 116113 IT 104. 20 Female AH2 Active
## 3 139110 IT 103. 40 Male AH2 Active
## 4 128534 IT 103. 54 Female B01 Active
## 5 118784 IT 102. 21 Female AH2 Active
## 6 130977 IT 99.5 44 Male AH2 Active
## 7 154657 IT 96.0 57 Female AH2 Active
## 8 114776 IT 95.5 39 Male AH2 Active
## 9 150997 IT 93.5 24 Male AH2 Active
## 10 169159 IT 91.0 59 Non-binary AH2 Active
## # … with 20 more rows
2. Sort by Department (ascending) and Salary (descending):
dat1 %>%
arrange(department,desc(salary_k))## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 122418 HR 62.9 28 Female AH2 Active
## 2 105305 HR 61.5 20 Female B01 Active
## 3 136118 HR 61.4 47 Female B01 Termed
## 4 104586 HR 60.1 57 Male B01 Active
## 5 138211 HR 60.0 62 Female B01 Active
## 6 140606 HR 59.8 44 Female AH2 Active
## 7 130070 HR 59.4 42 Female AH2 Active
## 8 107821 HR 54.4 60 Female AH2 Active
## 9 155878 HR 53.6 63 Female AH2 Active
## 10 151590 HR 49.8 62 Male AH2 Active
## # … with 20 more rows
We can use select() to select a subset of columns from our dataset. It can also be used to re-order columns based on the order in which you specify them to be selected and/or to rename them. Also, as with slice(), there are a number of helper functions.
1. Select only the gender and salary variables, in that order:
dat1 %>%
select(gender, salary_k)## # A tibble: 30 × 2
## gender salary_k
## <chr> <dbl>
## 1 Female 60.0
## 2 Female 53.6
## 3 Female 54.4
## 4 Female 61.5
## 5 Female 61.4
## 6 Male 49.8
## 7 Male 60.1
## 8 Female 62.9
## 9 Female 59.8
## 10 Female 59.4
## # … with 20 more rows
2. Select everything except gender:
dat1 %>%
select(!gender)## # A tibble: 30 × 6
## emp_id department salary_k age business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr>
## 1 138211 HR 60.0 62 B01 Active
## 2 155878 HR 53.6 63 AH2 Active
## 3 107821 HR 54.4 60 AH2 Active
## 4 105305 HR 61.5 20 B01 Active
## 5 136118 HR 61.4 47 B01 Termed
## 6 151590 HR 49.8 62 AH2 Active
## 7 104586 HR 60.1 57 B01 Active
## 8 122418 HR 62.9 28 AH2 Active
## 9 140606 HR 59.8 44 AH2 Active
## 10 130070 HR 59.4 42 AH2 Active
## # … with 20 more rows
3. Select the columns of department through age:
dat1 %>%
select(gender:age)## # A tibble: 30 × 2
## gender age
## <chr> <int>
## 1 Female 62
## 2 Female 63
## 3 Female 60
## 4 Female 20
## 5 Female 47
## 6 Male 62
## 7 Male 57
## 8 Female 28
## 9 Female 44
## 10 Female 42
## # … with 20 more rows
4. Select employee_id & emp_status renamed as just “status”:
dat1 %>%
select(emp_id, status = emp_status)## # A tibble: 30 × 2
## emp_id status
## <int> <chr>
## 1 138211 Active
## 2 155878 Active
## 3 107821 Active
## 4 105305 Active
## 5 136118 Termed
## 6 151590 Active
## 7 104586 Active
## 8 122418 Active
## 9 140606 Active
## 10 130070 Active
## # … with 20 more rows
dat1 %>%
select(ends_with("id"))## # A tibble: 30 × 2
## emp_id business_unit_id
## <int> <chr>
## 1 138211 B01
## 2 155878 AH2
## 3 107821 AH2
## 4 105305 B01
## 5 136118 B01
## 6 151590 AH2
## 7 104586 B01
## 8 122418 AH2
## 9 140606 AH2
## 10 130070 AH2
## # … with 20 more rows
dat1 %>%
select(contains("id"))## # A tibble: 30 × 2
## emp_id business_unit_id
## <int> <chr>
## 1 138211 B01
## 2 155878 AH2
## 3 107821 AH2
## 4 105305 B01
## 5 136118 B01
## 6 151590 AH2
## 7 104586 B01
## 8 122418 AH2
## 9 140606 AH2
## 10 130070 AH2
## # … with 20 more rows
dat1 %>%
select(starts_with("emp"))## # A tibble: 30 × 2
## emp_id emp_status
## <int> <chr>
## 1 138211 Active
## 2 155878 Active
## 3 107821 Active
## 4 105305 Active
## 5 136118 Termed
## 6 151590 Active
## 7 104586 Active
## 8 122418 Active
## 9 140606 Active
## 10 130070 Active
## # … with 20 more rows
dat1 %>%
select(any_of(c("emp_id","salary_k","org_category","manager")))## # A tibble: 30 × 2
## emp_id salary_k
## <int> <dbl>
## 1 138211 60.0
## 2 155878 53.6
## 3 107821 54.4
## 4 105305 61.5
## 5 136118 61.4
## 6 151590 49.8
## 7 104586 60.1
## 8 122418 62.9
## 9 140606 59.8
## 10 130070 59.4
## # … with 20 more rows
We can use mutate() to create a new column or re-write an existing one. We can either add the new or re-written columns to the dataset, or we can select only the new/re-written columns using transmute(). Lastly, relocate() can be used to re-position columns. I’ve included it in this section because I often use it in conjunction with creating/over-writing variables, as by default mutated variables are placed at the end of the dataset.
1. Create a new log-transformed salary variable:
dat1 %>%
mutate(log_salary = log(salary_k))## # A tibble: 30 × 8
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows, and 1 more variable: log_salary <dbl>
2. Over-write the gender variable as a lower-case version:
dat1 %>%
mutate(gender = tolower(gender)) ## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 female B01 Active
## 2 155878 HR 53.6 63 female AH2 Active
## 3 107821 HR 54.4 60 female AH2 Active
## 4 105305 HR 61.5 20 female B01 Active
## 5 136118 HR 61.4 47 female B01 Termed
## 6 151590 HR 49.8 62 male AH2 Active
## 7 104586 HR 60.1 57 male B01 Active
## 8 122418 HR 62.9 28 female AH2 Active
## 9 140606 HR 59.8 44 female AH2 Active
## 10 130070 HR 59.4 42 female AH2 Active
## # … with 20 more rows
3. Create a new scaled version of age and gender return only those new variables:
dat1 %>%
transmute(age_scaled = scale(age),
salary_scaled = scale(salary_k))## # A tibble: 30 × 2
## age_scaled[,1] salary_scaled[,1]
## <dbl> <dbl>
## 1 1.33 -0.243
## 2 1.40 -0.479
## 3 1.20 -0.448
## 4 -1.42 -0.190
## 5 0.350 -0.194
## 6 1.33 -0.620
## 7 1.01 -0.242
## 8 -0.897 -0.138
## 9 0.153 -0.251
## 10 0.0219 -0.266
## # … with 20 more rows
4. Create a scaled version of age and relocate the new variable before the existing age variable:
dat1 %>%
mutate(age_scaled = scale(age)) %>%
relocate(age_scaled, .before = age)## # A tibble: 30 × 8
## emp_id department salary_k age_scaled[,1] age gender business_unit_id
## <int> <chr> <dbl> <dbl> <int> <chr> <chr>
## 1 138211 HR 60.0 1.33 62 Female B01
## 2 155878 HR 53.6 1.40 63 Female AH2
## 3 107821 HR 54.4 1.20 60 Female AH2
## 4 105305 HR 61.5 -1.42 20 Female B01
## 5 136118 HR 61.4 0.350 47 Female B01
## 6 151590 HR 49.8 1.33 62 Male AH2
## 7 104586 HR 60.1 1.01 57 Male B01
## 8 122418 HR 62.9 -0.897 28 Female AH2
## 9 140606 HR 59.8 0.153 44 Female AH2
## 10 130070 HR 59.4 0.0219 42 Female AH2
## # … with 20 more rows, and 1 more variable: emp_status <chr>
We can use distinct() and n_distinct() to return the distinct records/values of a dataframe/variable, or to count the number of distinct values/records.
Note: The distinct() function takes in a dataframe as input, while n_distinct() takes in vectors. This impacts how each can be used, as shown below.
1. Return the distinct department values:
dat1 %>%
distinct(department)## # A tibble: 3 × 1
## department
## <chr>
## 1 HR
## 2 IT
## 3 Sales
2. Return the distinct combinations of department & employee status:
dat1 %>%
distinct(department, emp_status)## # A tibble: 5 × 2
## department emp_status
## <chr> <chr>
## 1 HR Active
## 2 HR Termed
## 3 IT Active
## 4 Sales Active
## 5 Sales Termed
3. Return the records that are distinct:
dat1 %>%
distinct()## # A tibble: 30 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows
To illustrate how n_distinct() takes in vectors instead of dataframes, let’s see what happens when we try and use our usual piping:
1. Count the number of distinct gender identities:
dat1 %>%
n_distinct(gender)## Error in list2(...): object 'gender' not found
Instead, we have two main options.
dat1 %>%
summarize(n_genders = n_distinct(gender))## # A tibble: 1 × 1
## n_genders
## <int>
## 1 3
dat1 %>%
mutate(n_genders = n_distinct(gender))## # A tibble: 30 × 8
## emp_id department salary_k age gender business_unit_id emp_status n_genders
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 3
## 2 155878 HR 53.6 63 Female AH2 Active 3
## 3 107821 HR 54.4 60 Female AH2 Active 3
## 4 105305 HR 61.5 20 Female B01 Active 3
## 5 136118 HR 61.4 47 Female B01 Termed 3
## 6 151590 HR 49.8 62 Male AH2 Active 3
## 7 104586 HR 60.1 57 Male B01 Active 3
## 8 122418 HR 62.9 28 Female AH2 Active 3
## 9 140606 HR 59.8 44 Female AH2 Active 3
## 10 130070 HR 59.4 42 Female AH2 Active 3
## # … with 20 more rows
n_distinct(dat1$gender)## [1] 3
We can use rename() to rename columns in our data. We can also use rename_with() to rename the columns with another function, like one that converts the names to upper or lower case for example.
1. Rename salary_k as ‘salary_thousands’, and emp_id as ‘id’:
dat1 %>%
rename(salary_thousands = salary_k,
id = emp_id)## # A tibble: 30 × 7
## id department salary_thousands age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows
1: Convert column names to upper case:
dat1 %>%
rename_with(toupper)## # A tibble: 30 × 7
## EMP_ID DEPARTMENT SALARY_K AGE GENDER BUSINESS_UNIT_ID EMP_STATUS
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows
2: Convert column names ending in “id” to upper case:
dat1 %>%
rename_with(toupper, ends_with("id"))## # A tibble: 30 × 7
## EMP_ID department salary_k age gender BUSINESS_UNIT_ID emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows
**3: Rename column names by replacing “_” with “.”:**
dat1 %>%
rename_with(~gsub(pattern = "_",replacement = ".", x = .x))## # A tibble: 30 × 7
## emp.id department salary.k age gender business.unit.id emp.status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 62 Female B01 Active
## 2 155878 HR 53.6 63 Female AH2 Active
## 3 107821 HR 54.4 60 Female AH2 Active
## 4 105305 HR 61.5 20 Female B01 Active
## 5 136118 HR 61.4 47 Female B01 Termed
## 6 151590 HR 49.8 62 Male AH2 Active
## 7 104586 HR 60.1 57 Male B01 Active
## 8 122418 HR 62.9 28 Female AH2 Active
## 9 140606 HR 59.8 44 Female AH2 Active
## 10 130070 HR 59.4 42 Female AH2 Active
## # … with 20 more rows
We can also chain manipulation steps/functions together using multiple pipe operators. We saw an example of this above when we created a new age variable and then re-positioned the new column. The possibilities for using chained data manipulation pipelines is endless, but here are a few simpler examples:
1. Filter for Male employees in the IT department and sort by salary from highest to lowest:
dat1 %>%
filter(department=="IT" & gender=="Male") %>%
arrange(desc(salary_k))## # A tibble: 5 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 150894 IT 112. 35 Male B01 Active
## 2 139110 IT 103. 40 Male AH2 Active
## 3 130977 IT 99.5 44 Male AH2 Active
## 4 114776 IT 95.5 39 Male AH2 Active
## 5 150997 IT 93.5 24 Male AH2 Active
2. Sort employees by salary from highest to lowest and then select the employees with the top 3 salaries:
dat1 %>%
arrange(desc(salary_k)) %>%
slice(1:3)## # A tibble: 3 × 7
## emp_id department salary_k age gender business_unit_id emp_status
## <int> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 150894 IT 112. 35 Male B01 Active
## 2 116113 IT 104. 20 Female AH2 Active
## 3 139110 IT 103. 40 Male AH2 Active
3. Create a proposed salary increase of 15%, pull the rows associated with the top 3 proposed salaries, and then select only the employee id, current salary, and proposed salary variables for those employees:
dat1 %>%
mutate(salary_proposed = round(salary_k*1.15, 2)) %>%
slice_max(order_by = salary_proposed, n = 3) %>%
select(emp_id, salary_k, salary_proposed)## # A tibble: 3 × 3
## emp_id salary_k salary_proposed
## <int> <dbl> <dbl>
## 1 150894 112. 129.
## 2 116113 104. 120.
## 3 139110 103. 119.
4. Filter for only those employees who make less than 80% of the overall average salary, count the gender frequencies for those employees, and then create a new column converting the gender count into a percentage:
dat1 %>%
filter(salary_k < mean(salary_k)*.8) %>%
count(gender, name = "employee_count") %>%
mutate(pct_emp_count = round(employee_count/sum(employee_count),2))## # A tibble: 2 × 3
## gender employee_count pct_emp_count
## <chr> <int> <dbl>
## 1 Female 5 0.71
## 2 Male 2 0.29
We can use group_by() to group our data by one or more variables, and then perform data manipulation steps at the group-level. When doing so, we can either use summarize() to compute a group-level summary statistic as our output, or we can use mutate() to add the group-level variable(s) alongside our pre-existing non-grouped data. Lastly, we can use ungroup() to un-group our data if we wish to be able perform subsequent operations that are not at the group level.
1. Group employees by gender and calculate the avg salary:
dat1 %>%
group_by(gender) %>%
summarize(avg_salary = mean(salary_k))## # A tibble: 3 × 2
## gender avg_salary
## <chr> <dbl>
## 1 Female 59.5
## 2 Male 79.6
## 3 Non-binary 76.6
2. Group employees by department and calculate the the proportion of employees in each department that identify as Males:
dat1 %>%
group_by(department) %>%
summarize(prop_male = mean(gender=="Male"))## # A tibble: 3 × 2
## department prop_male
## <chr> <dbl>
## 1 HR 0.2
## 2 IT 0.5
## 3 Sales 0.2
1. Group by gender and calculate the average age, but capture this as a new variable added to the existing non-grouped dataset:
dat1 %>%
group_by(gender) %>%
mutate(avg_age = round(mean(age),2)) %>%
relocate(avg_age,.before = age)## # A tibble: 30 × 8
## # Groups: gender [3]
## emp_id department salary_k avg_age age gender business_unit_id emp_status
## <int> <chr> <dbl> <dbl> <int> <chr> <chr> <chr>
## 1 138211 HR 60.0 41 62 Female B01 Active
## 2 155878 HR 53.6 41 63 Female AH2 Active
## 3 107821 HR 54.4 41 60 Female AH2 Active
## 4 105305 HR 61.5 41 20 Female B01 Active
## 5 136118 HR 61.4 41 47 Female B01 Termed
## 6 151590 HR 49.8 43.2 62 Male AH2 Active
## 7 104586 HR 60.1 43.2 57 Male B01 Active
## 8 122418 HR 62.9 41 28 Female AH2 Active
## 9 140606 HR 59.8 41 44 Female AH2 Active
## 10 130070 HR 59.4 41 42 Female AH2 Active
## # … with 20 more rows
1. Calculate the avg salary by gender and department:
dat1 %>%
group_by(gender, department) %>%
summarize(avg_salary = round(mean(salary_k),2))## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 3
## # Groups: gender [3]
## gender department avg_salary
## <chr> <chr> <dbl>
## 1 Female HR 59.1
## 2 Female IT 101.
## 3 Female Sales 36.2
## 4 Male HR 54.9
## 5 Male IT 101.
## 6 Male Sales 51.5
## 7 Non-binary IT 91.0
## 8 Non-binary Sales 62.1
1. Calculate the avg age by gender as a new variable added to the non-grouped data, and then create another variable representing the overall avg age:
dat1 %>%
group_by(gender) %>%
mutate(avg_age_grped = round(mean(age),2)) %>%
ungroup() %>%
mutate(avg_age_overall = round(mean(age),2)) %>%
relocate(c(avg_age_overall, avg_age_grped),.before = age) ## # A tibble: 30 × 9
## emp_id department salary_k avg_age_overall avg_age_grped age gender
## <int> <chr> <dbl> <dbl> <dbl> <int> <chr>
## 1 138211 HR 60.0 41.7 41 62 Female
## 2 155878 HR 53.6 41.7 41 63 Female
## 3 107821 HR 54.4 41.7 41 60 Female
## 4 105305 HR 61.5 41.7 41 20 Female
## 5 136118 HR 61.4 41.7 41 47 Female
## 6 151590 HR 49.8 41.7 43.2 62 Male
## 7 104586 HR 60.1 41.7 43.2 57 Male
## 8 122418 HR 62.9 41.7 41 28 Female
## 9 140606 HR 59.8 41.7 41 44 Female
## 10 130070 HR 59.4 41.7 41 42 Female
## # … with 20 more rows, and 2 more variables: business_unit_id <chr>,
## # emp_status <chr>
1. Calculate the avg salary by a newly-created age bucket that is a function of the existing age variable:
dat1 %>%
group_by(age_bucket = cut(age,2)) %>%
summarize(avg_salary = round(mean(salary_k),2))## # A tibble: 2 × 2
## age_bucket avg_salary
## <fct> <dbl>
## 1 (20,41.5] 67.7
## 2 (41.5,63] 65.7
We can use count() to quickly create crosstabs/frequencies of data values by single or multiple variables.
1. Tabulate the number of employees/records in each business unit:
dat1 %>%
count(business_unit_id)## # A tibble: 2 × 2
## business_unit_id n
## <chr> <int>
## 1 AH2 19
## 2 B01 11
2. Create a crosstab counting the number of employees/records in unique combination of business unit and department:
dat1 %>%
count(business_unit_id,department)## # A tibble: 6 × 3
## business_unit_id department n
## <chr> <chr> <int>
## 1 AH2 HR 6
## 2 AH2 IT 8
## 3 AH2 Sales 5
## 4 B01 HR 4
## 5 B01 IT 2
## 6 B01 Sales 5
3. Crosstab the number of employees by salary bracket and sort results by largest number of employees:
dat1 %>%
count(cut(salary_k, breaks = 3), sort = T) %>%
rename(salary_bucket = `cut(salary_k, breaks = 3)`)## # A tibble: 3 × 2
## salary_bucket n
## <fct> <int>
## 1 (45.3,78.8] 14
## 2 (78.8,112] 10
## 3 (11.7,45.3] 6
Note: When used on factors, count() has a .drop = FALSE argument default that will cause empty factor levels to be dropped. This argument can be switched to .drop = T,
Let’s take a look at two quick examples. First, let’s convert our gender variable to a factor and add in a new level that no employees actually belong to:
dat1$gender <- factor(dat1$gender,
levels = c("Female","Male",
"Non-binary","Declined"))
levels(dat1$gender)## [1] "Female" "Male" "Non-binary" "Declined"
1. Count the number of employees by gender grouping:
dat1 %>%
count(gender)## # A tibble: 3 × 2
## gender n
## <fct> <int>
## 1 Female 19
## 2 Male 9
## 3 Non-binary 2
2. Count the number of employees by gender, INCLUDING the gender groups that are empty in our dataset:
dat1 %>%
count(gender, .drop = FALSE)## # A tibble: 4 × 2
## gender n
## <fct> <int>
## 1 Female 19
## 2 Male 9
## 3 Non-binary 2
## 4 Declined 0
Thanks for reading this tutorial!