Fundamentals of Data Manipulation with Dplyr

Gordon Goodwin

Foundations of Data Wrangling with dplyr

This tutorial will cover the essential basic dplyr functions that create the foundation for efficient data wrangling.

Sample Data

Let’s first create a sample dataset consisting of some basic employee data.

set.seed(999)

dat1 <- tibble(emp_id = sample(100000:200000,
                               size = 30),
               department = c(rep("HR",10),
                              rep("IT",10),
                              rep("Sales",10)),
               salary_k = round(c(rnorm(10,60,5),
                                  rnorm(10,100,7),
                                  rnorm(10,40,20)),2),
               age = sample(20:66,30,T),
               gender = sample(c("Male","Female","Non-binary"),30,T,
                               prob = c(0.45,0.45,0.1)),
               business_unit_id = sample(c("B01","AH2"),30,T),
               emp_status = sample(c("Active","Termed"),30,T,
                                   prob = c(0.8,0.2)))

head(dat1)
## # A tibble: 6 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 138211 HR             60.0    62 Female B01              Active    
## 2 155878 HR             53.6    63 Female AH2              Active    
## 3 107821 HR             54.4    60 Female AH2              Active    
## 4 105305 HR             61.5    20 Female B01              Active    
## 5 136118 HR             61.4    47 Female B01              Termed    
## 6 151590 HR             49.8    62 Male   AH2              Active

Filter

We can use filter() to select rows based on conditions. We can use a single condition, multiple conditions, or either/or conditions.

Single Condition Filters

1. Filter for employees over the age of 60:

dat1 %>% 
  filter(age > 60)
## # A tibble: 5 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 138211 HR             60.0    62 Female B01              Active    
## 2 155878 HR             53.6    63 Female AH2              Active    
## 3 151590 HR             49.8    62 Male   AH2              Active    
## 4 187244 Sales          39.6    63 Male   AH2              Active    
## 5 140579 Sales          37.9    62 Female B01              Active

2. Filter for employees who identify as Male:

dat1 %>% 
  filter(gender=="Male")
## # A tibble: 9 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 151590 HR             49.8    62 Male   AH2              Active    
## 2 104586 HR             60.1    57 Male   B01              Active    
## 3 114776 IT             95.5    39 Male   AH2              Active    
## 4 150894 IT            112.     35 Male   B01              Active    
## 5 130977 IT             99.5    44 Male   AH2              Active    
## 6 139110 IT            103.     40 Male   AH2              Active    
## 7 150997 IT             93.5    24 Male   AH2              Active    
## 8 161782 Sales          63.3    25 Male   AH2              Active    
## 9 187244 Sales          39.6    63 Male   AH2              Active

Multiple Condition Filters

1. Filter for employees who are either belong to the IT department OR have a salary of over $80k:

dat1 %>% 
  filter(department=="IT" | salary_k > 80)
## # A tibble: 10 × 7
##    emp_id department salary_k   age gender     business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
##  1 114776 IT             95.5    39 Male       AH2              Active    
##  2 150894 IT            112.     35 Male       B01              Active    
##  3 128534 IT            103.     54 Female     B01              Active    
##  4 130977 IT             99.5    44 Male       AH2              Active    
##  5 118784 IT            102.     21 Female     AH2              Active    
##  6 116113 IT            104.     20 Female     AH2              Active    
##  7 169159 IT             91.0    59 Non-binary AH2              Active    
##  8 139110 IT            103.     40 Male       AH2              Active    
##  9 154657 IT             96.0    57 Female     AH2              Active    
## 10 150997 IT             93.5    24 Male       AH2              Active

2. Filter for employees who are Female AND are in the Sales department:

dat1 %>% 
  filter(gender=="Female" & department=="Sales")
## # A tibble: 7 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 132449 Sales          60.8    28 Female B01              Active    
## 2 186174 Sales          17.1    34 Female AH2              Active    
## 3 195322 Sales          11.8    29 Female B01              Active    
## 4 112459 Sales          34.4    39 Female AH2              Termed    
## 5 131186 Sales          31.6    24 Female B01              Active    
## 6 181702 Sales          59.9    45 Female AH2              Active    
## 7 140579 Sales          37.9    62 Female B01              Active

Slice

We can use slice() to “filter” (bad choice of words), or index, rows based on their position. There are also several helper functions that speed up the process and/or allow for more complex manipulation goals.

Regular Slice

1. Pull data for the employees in the first 5 rows of our dataset:

dat1 %>% 
  slice(1:5)
## # A tibble: 5 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 138211 HR             60.0    62 Female B01              Active    
## 2 155878 HR             53.6    63 Female AH2              Active    
## 3 107821 HR             54.4    60 Female AH2              Active    
## 4 105305 HR             61.5    20 Female B01              Active    
## 5 136118 HR             61.4    47 Female B01              Termed

2. Pull data for the 3rd and 8th employee:

dat1 %>% 
  slice(c(3,8))
## # A tibble: 2 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 107821 HR             54.4    60 Female AH2              Active    
## 2 122418 HR             62.9    28 Female AH2              Active

Slice Helpers

  1. slice_head() - Select the top 7 rows:
dat1 %>% 
  slice_head(n = 3)
## # A tibble: 3 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 138211 HR             60.0    62 Female B01              Active    
## 2 155878 HR             53.6    63 Female AH2              Active    
## 3 107821 HR             54.4    60 Female AH2              Active
  1. slice_tail() - Select the bottom 4 rows:
dat1 %>% 
  slice_tail(n = 4)
## # A tibble: 4 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 112459 Sales          34.4    39 Female AH2              Termed    
## 2 131186 Sales          31.6    24 Female B01              Active    
## 3 181702 Sales          59.9    45 Female AH2              Active    
## 4 140579 Sales          37.9    62 Female B01              Active
  1. slice_max() - Select the rows with the 3 highest salaries:
dat1 %>% 
  slice_max(order_by = salary_k, n = 3)
## # A tibble: 3 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 150894 IT             112.    35 Male   B01              Active    
## 2 116113 IT             104.    20 Female AH2              Active    
## 3 139110 IT             103.    40 Male   AH2              Active
  1. slice_min() - Select the rows with the lowest age:
dat1 %>% 
  slice_min(order_by = age, n = 1)
## # A tibble: 2 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 105305 HR             61.5    20 Female B01              Active    
## 2 116113 IT            104.     20 Female AH2              Active
  1. slice_sample() - Randomly select 5 rows:
dat1 %>% 
  slice_sample(n = 5)
## # A tibble: 5 × 7
##   emp_id department salary_k   age gender     business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
## 1 195322 Sales          11.8    29 Female     B01              Active    
## 2 161782 Sales          63.3    25 Male       AH2              Active    
## 3 136118 HR             61.4    47 Female     B01              Termed    
## 4 181702 Sales          59.9    45 Female     AH2              Active    
## 5 169159 IT             91.0    59 Non-binary AH2              Active
  1. slice_sample() - Randomly select 20% of the rows in the dataset:
dat1 %>% 
  slice_sample(prop = .2)
## # A tibble: 6 × 7
##   emp_id department salary_k   age gender     business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
## 1 130070 HR             59.4    42 Female     AH2              Active    
## 2 139110 IT            103.     40 Male       AH2              Active    
## 3 169159 IT             91.0    59 Non-binary AH2              Active    
## 4 187244 Sales          39.6    63 Male       AH2              Active    
## 5 132449 Sales          60.8    28 Female     B01              Active    
## 6 181702 Sales          59.9    45 Female     AH2              Active

Note: With slice_sample(), the replace = TRUE argument can be specified to perform a bootstrapped sample.


Arrange

We can use arrange() to sort rows by a single variable or by multiple variables in either ascending or descending order. This can also be used for character/categorical variables, not just numerics.

Single-Variable Sorting

1. Sort employees by age in ascending order:

dat1 %>% 
  arrange(age)
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender     business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
##  1 105305 HR             61.5    20 Female     B01              Active    
##  2 116113 IT            104.     20 Female     AH2              Active    
##  3 118784 IT            102.     21 Female     AH2              Active    
##  4 156989 Sales          62.1    23 Non-binary B01              Active    
##  5 150997 IT             93.5    24 Male       AH2              Active    
##  6 131186 Sales          31.6    24 Female     B01              Active    
##  7 161782 Sales          63.3    25 Male       AH2              Active    
##  8 122418 HR             62.9    28 Female     AH2              Active    
##  9 132449 Sales          60.8    28 Female     B01              Active    
## 10 195322 Sales          11.8    29 Female     B01              Active    
## # … with 20 more rows

2. Sort employees by department in descending (reverse-alphabetical) order:

dat1 %>% 
  arrange(desc(department))
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender     business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
##  1 161782 Sales          63.3    25 Male       AH2              Active    
##  2 132449 Sales          60.8    28 Female     B01              Active    
##  3 156989 Sales          62.1    23 Non-binary B01              Active    
##  4 187244 Sales          39.6    63 Male       AH2              Active    
##  5 186174 Sales          17.1    34 Female     AH2              Active    
##  6 195322 Sales          11.8    29 Female     B01              Active    
##  7 112459 Sales          34.4    39 Female     AH2              Termed    
##  8 131186 Sales          31.6    24 Female     B01              Active    
##  9 181702 Sales          59.9    45 Female     AH2              Active    
## 10 140579 Sales          37.9    62 Female     B01              Active    
## # … with 20 more rows

Multi-variable Sorting

1. Sort by both Salary (descending) and Age (ascending):

dat1 %>% 
  arrange(desc(salary_k),age)
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender     business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>      <chr>            <chr>     
##  1 150894 IT            112.     35 Male       B01              Active    
##  2 116113 IT            104.     20 Female     AH2              Active    
##  3 139110 IT            103.     40 Male       AH2              Active    
##  4 128534 IT            103.     54 Female     B01              Active    
##  5 118784 IT            102.     21 Female     AH2              Active    
##  6 130977 IT             99.5    44 Male       AH2              Active    
##  7 154657 IT             96.0    57 Female     AH2              Active    
##  8 114776 IT             95.5    39 Male       AH2              Active    
##  9 150997 IT             93.5    24 Male       AH2              Active    
## 10 169159 IT             91.0    59 Non-binary AH2              Active    
## # … with 20 more rows

2. Sort by Department (ascending) and Salary (descending):

dat1 %>% 
  arrange(department,desc(salary_k))
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 122418 HR             62.9    28 Female AH2              Active    
##  2 105305 HR             61.5    20 Female B01              Active    
##  3 136118 HR             61.4    47 Female B01              Termed    
##  4 104586 HR             60.1    57 Male   B01              Active    
##  5 138211 HR             60.0    62 Female B01              Active    
##  6 140606 HR             59.8    44 Female AH2              Active    
##  7 130070 HR             59.4    42 Female AH2              Active    
##  8 107821 HR             54.4    60 Female AH2              Active    
##  9 155878 HR             53.6    63 Female AH2              Active    
## 10 151590 HR             49.8    62 Male   AH2              Active    
## # … with 20 more rows

Select

We can use select() to select a subset of columns from our dataset. It can also be used to re-order columns based on the order in which you specify them to be selected and/or to rename them. Also, as with slice(), there are a number of helper functions.

Simple Selects (No Helpers)

1. Select only the gender and salary variables, in that order:

dat1 %>% 
  select(gender, salary_k)
## # A tibble: 30 × 2
##    gender salary_k
##    <chr>     <dbl>
##  1 Female     60.0
##  2 Female     53.6
##  3 Female     54.4
##  4 Female     61.5
##  5 Female     61.4
##  6 Male       49.8
##  7 Male       60.1
##  8 Female     62.9
##  9 Female     59.8
## 10 Female     59.4
## # … with 20 more rows

2. Select everything except gender:

dat1 %>% 
  select(!gender)
## # A tibble: 30 × 6
##    emp_id department salary_k   age business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>            <chr>     
##  1 138211 HR             60.0    62 B01              Active    
##  2 155878 HR             53.6    63 AH2              Active    
##  3 107821 HR             54.4    60 AH2              Active    
##  4 105305 HR             61.5    20 B01              Active    
##  5 136118 HR             61.4    47 B01              Termed    
##  6 151590 HR             49.8    62 AH2              Active    
##  7 104586 HR             60.1    57 B01              Active    
##  8 122418 HR             62.9    28 AH2              Active    
##  9 140606 HR             59.8    44 AH2              Active    
## 10 130070 HR             59.4    42 AH2              Active    
## # … with 20 more rows

3. Select the columns of department through age:

dat1 %>% 
  select(gender:age)
## # A tibble: 30 × 2
##    gender   age
##    <chr>  <int>
##  1 Female    62
##  2 Female    63
##  3 Female    60
##  4 Female    20
##  5 Female    47
##  6 Male      62
##  7 Male      57
##  8 Female    28
##  9 Female    44
## 10 Female    42
## # … with 20 more rows

4. Select employee_id & emp_status renamed as just “status”:

dat1 %>% 
  select(emp_id, status = emp_status)
## # A tibble: 30 × 2
##    emp_id status
##     <int> <chr> 
##  1 138211 Active
##  2 155878 Active
##  3 107821 Active
##  4 105305 Active
##  5 136118 Termed
##  6 151590 Active
##  7 104586 Active
##  8 122418 Active
##  9 140606 Active
## 10 130070 Active
## # … with 20 more rows

Select Helpers

  1. ends_with(): Select columns ending with “id”:
dat1 %>% 
  select(ends_with("id"))
## # A tibble: 30 × 2
##    emp_id business_unit_id
##     <int> <chr>           
##  1 138211 B01             
##  2 155878 AH2             
##  3 107821 AH2             
##  4 105305 B01             
##  5 136118 B01             
##  6 151590 AH2             
##  7 104586 B01             
##  8 122418 AH2             
##  9 140606 AH2             
## 10 130070 AH2             
## # … with 20 more rows
  1. contains(): Select columns containing “id”:
dat1 %>% 
  select(contains("id"))
## # A tibble: 30 × 2
##    emp_id business_unit_id
##     <int> <chr>           
##  1 138211 B01             
##  2 155878 AH2             
##  3 107821 AH2             
##  4 105305 B01             
##  5 136118 B01             
##  6 151590 AH2             
##  7 104586 B01             
##  8 122418 AH2             
##  9 140606 AH2             
## 10 130070 AH2             
## # … with 20 more rows
  1. start_with(): Select columns starting with “emp”:
dat1 %>% 
  select(starts_with("emp"))
## # A tibble: 30 × 2
##    emp_id emp_status
##     <int> <chr>     
##  1 138211 Active    
##  2 155878 Active    
##  3 107821 Active    
##  4 105305 Active    
##  5 136118 Termed    
##  6 151590 Active    
##  7 104586 Active    
##  8 122418 Active    
##  9 140606 Active    
## 10 130070 Active    
## # … with 20 more rows
  1. any_of(): Select columns matching any of a specified set:
dat1 %>% 
  select(any_of(c("emp_id","salary_k","org_category","manager")))
## # A tibble: 30 × 2
##    emp_id salary_k
##     <int>    <dbl>
##  1 138211     60.0
##  2 155878     53.6
##  3 107821     54.4
##  4 105305     61.5
##  5 136118     61.4
##  6 151590     49.8
##  7 104586     60.1
##  8 122418     62.9
##  9 140606     59.8
## 10 130070     59.4
## # … with 20 more rows

Mutate, Transmute, & Relocate

We can use mutate() to create a new column or re-write an existing one. We can either add the new or re-written columns to the dataset, or we can select only the new/re-written columns using transmute(). Lastly, relocate() can be used to re-position columns. I’ve included it in this section because I often use it in conjunction with creating/over-writing variables, as by default mutated variables are placed at the end of the dataset.

1. Create a new log-transformed salary variable:

dat1 %>% 
  mutate(log_salary = log(salary_k))
## # A tibble: 30 × 8
##    emp_id department salary_k   age gender business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 Female B01              Active    
##  2 155878 HR             53.6    63 Female AH2              Active    
##  3 107821 HR             54.4    60 Female AH2              Active    
##  4 105305 HR             61.5    20 Female B01              Active    
##  5 136118 HR             61.4    47 Female B01              Termed    
##  6 151590 HR             49.8    62 Male   AH2              Active    
##  7 104586 HR             60.1    57 Male   B01              Active    
##  8 122418 HR             62.9    28 Female AH2              Active    
##  9 140606 HR             59.8    44 Female AH2              Active    
## 10 130070 HR             59.4    42 Female AH2              Active    
## # … with 20 more rows, and 1 more variable: log_salary <dbl>

2. Over-write the gender variable as a lower-case version:

dat1 %>% 
  mutate(gender = tolower(gender)) 
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 female B01              Active    
##  2 155878 HR             53.6    63 female AH2              Active    
##  3 107821 HR             54.4    60 female AH2              Active    
##  4 105305 HR             61.5    20 female B01              Active    
##  5 136118 HR             61.4    47 female B01              Termed    
##  6 151590 HR             49.8    62 male   AH2              Active    
##  7 104586 HR             60.1    57 male   B01              Active    
##  8 122418 HR             62.9    28 female AH2              Active    
##  9 140606 HR             59.8    44 female AH2              Active    
## 10 130070 HR             59.4    42 female AH2              Active    
## # … with 20 more rows

3. Create a new scaled version of age and gender return only those new variables:

dat1 %>% 
  transmute(age_scaled = scale(age),
            salary_scaled = scale(salary_k))
## # A tibble: 30 × 2
##    age_scaled[,1] salary_scaled[,1]
##             <dbl>             <dbl>
##  1         1.33              -0.243
##  2         1.40              -0.479
##  3         1.20              -0.448
##  4        -1.42              -0.190
##  5         0.350             -0.194
##  6         1.33              -0.620
##  7         1.01              -0.242
##  8        -0.897             -0.138
##  9         0.153             -0.251
## 10         0.0219            -0.266
## # … with 20 more rows

4. Create a scaled version of age and relocate the new variable before the existing age variable:

dat1 %>% 
  mutate(age_scaled = scale(age)) %>% 
  relocate(age_scaled, .before = age)
## # A tibble: 30 × 8
##    emp_id department salary_k age_scaled[,1]   age gender business_unit_id
##     <int> <chr>         <dbl>          <dbl> <int> <chr>  <chr>           
##  1 138211 HR             60.0         1.33      62 Female B01             
##  2 155878 HR             53.6         1.40      63 Female AH2             
##  3 107821 HR             54.4         1.20      60 Female AH2             
##  4 105305 HR             61.5        -1.42      20 Female B01             
##  5 136118 HR             61.4         0.350     47 Female B01             
##  6 151590 HR             49.8         1.33      62 Male   AH2             
##  7 104586 HR             60.1         1.01      57 Male   B01             
##  8 122418 HR             62.9        -0.897     28 Female AH2             
##  9 140606 HR             59.8         0.153     44 Female AH2             
## 10 130070 HR             59.4         0.0219    42 Female AH2             
## # … with 20 more rows, and 1 more variable: emp_status <chr>

Distinct & N_Distinct

We can use distinct() and n_distinct() to return the distinct records/values of a dataframe/variable, or to count the number of distinct values/records.

Note: The distinct() function takes in a dataframe as input, while n_distinct() takes in vectors. This impacts how each can be used, as shown below.

Distinct

1. Return the distinct department values:

dat1 %>% 
  distinct(department)
## # A tibble: 3 × 1
##   department
##   <chr>     
## 1 HR        
## 2 IT        
## 3 Sales

2. Return the distinct combinations of department & employee status:

dat1 %>% 
  distinct(department, emp_status)
## # A tibble: 5 × 2
##   department emp_status
##   <chr>      <chr>     
## 1 HR         Active    
## 2 HR         Termed    
## 3 IT         Active    
## 4 Sales      Active    
## 5 Sales      Termed

3. Return the records that are distinct:

dat1 %>% 
  distinct()
## # A tibble: 30 × 7
##    emp_id department salary_k   age gender business_unit_id emp_status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 Female B01              Active    
##  2 155878 HR             53.6    63 Female AH2              Active    
##  3 107821 HR             54.4    60 Female AH2              Active    
##  4 105305 HR             61.5    20 Female B01              Active    
##  5 136118 HR             61.4    47 Female B01              Termed    
##  6 151590 HR             49.8    62 Male   AH2              Active    
##  7 104586 HR             60.1    57 Male   B01              Active    
##  8 122418 HR             62.9    28 Female AH2              Active    
##  9 140606 HR             59.8    44 Female AH2              Active    
## 10 130070 HR             59.4    42 Female AH2              Active    
## # … with 20 more rows

N_Distinct

To illustrate how n_distinct() takes in vectors instead of dataframes, let’s see what happens when we try and use our usual piping:

1. Count the number of distinct gender identities:

dat1 %>% 
  n_distinct(gender)
## Error in list2(...): object 'gender' not found

Instead, we have two main options.

  1. We can use summarize() or mutate():
dat1 %>% 
  summarize(n_genders = n_distinct(gender))
## # A tibble: 1 × 1
##   n_genders
##       <int>
## 1         3
dat1 %>% 
  mutate(n_genders = n_distinct(gender))
## # A tibble: 30 × 8
##    emp_id department salary_k   age gender business_unit_id emp_status n_genders
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>          <int>
##  1 138211 HR             60.0    62 Female B01              Active             3
##  2 155878 HR             53.6    63 Female AH2              Active             3
##  3 107821 HR             54.4    60 Female AH2              Active             3
##  4 105305 HR             61.5    20 Female B01              Active             3
##  5 136118 HR             61.4    47 Female B01              Termed             3
##  6 151590 HR             49.8    62 Male   AH2              Active             3
##  7 104586 HR             60.1    57 Male   B01              Active             3
##  8 122418 HR             62.9    28 Female AH2              Active             3
##  9 140606 HR             59.8    44 Female AH2              Active             3
## 10 130070 HR             59.4    42 Female AH2              Active             3
## # … with 20 more rows
  1. We can use a non-piping approach with the $ operator:
n_distinct(dat1$gender)
## [1] 3

Rename & Rename_With

We can use rename() to rename columns in our data. We can also use rename_with() to rename the columns with another function, like one that converts the names to upper or lower case for example.

Rename

1. Rename salary_k as ‘salary_thousands’, and emp_id as ‘id’:

dat1 %>% 
  rename(salary_thousands = salary_k,
         id = emp_id)
## # A tibble: 30 × 7
##        id department salary_thousands   age gender business_unit_id emp_status
##     <int> <chr>                 <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR                     60.0    62 Female B01              Active    
##  2 155878 HR                     53.6    63 Female AH2              Active    
##  3 107821 HR                     54.4    60 Female AH2              Active    
##  4 105305 HR                     61.5    20 Female B01              Active    
##  5 136118 HR                     61.4    47 Female B01              Termed    
##  6 151590 HR                     49.8    62 Male   AH2              Active    
##  7 104586 HR                     60.1    57 Male   B01              Active    
##  8 122418 HR                     62.9    28 Female AH2              Active    
##  9 140606 HR                     59.8    44 Female AH2              Active    
## 10 130070 HR                     59.4    42 Female AH2              Active    
## # … with 20 more rows

Rename_With

1: Convert column names to upper case:

dat1 %>% 
  rename_with(toupper)
## # A tibble: 30 × 7
##    EMP_ID DEPARTMENT SALARY_K   AGE GENDER BUSINESS_UNIT_ID EMP_STATUS
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 Female B01              Active    
##  2 155878 HR             53.6    63 Female AH2              Active    
##  3 107821 HR             54.4    60 Female AH2              Active    
##  4 105305 HR             61.5    20 Female B01              Active    
##  5 136118 HR             61.4    47 Female B01              Termed    
##  6 151590 HR             49.8    62 Male   AH2              Active    
##  7 104586 HR             60.1    57 Male   B01              Active    
##  8 122418 HR             62.9    28 Female AH2              Active    
##  9 140606 HR             59.8    44 Female AH2              Active    
## 10 130070 HR             59.4    42 Female AH2              Active    
## # … with 20 more rows

2: Convert column names ending in “id” to upper case:

dat1 %>% 
  rename_with(toupper, ends_with("id"))
## # A tibble: 30 × 7
##    EMP_ID department salary_k   age gender BUSINESS_UNIT_ID emp_status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 Female B01              Active    
##  2 155878 HR             53.6    63 Female AH2              Active    
##  3 107821 HR             54.4    60 Female AH2              Active    
##  4 105305 HR             61.5    20 Female B01              Active    
##  5 136118 HR             61.4    47 Female B01              Termed    
##  6 151590 HR             49.8    62 Male   AH2              Active    
##  7 104586 HR             60.1    57 Male   B01              Active    
##  8 122418 HR             62.9    28 Female AH2              Active    
##  9 140606 HR             59.8    44 Female AH2              Active    
## 10 130070 HR             59.4    42 Female AH2              Active    
## # … with 20 more rows

**3: Rename column names by replacing “_” with “.”:**

dat1 %>% 
  rename_with(~gsub(pattern = "_",replacement = ".", x = .x))
## # A tibble: 30 × 7
##    emp.id department salary.k   age gender business.unit.id emp.status
##     <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    62 Female B01              Active    
##  2 155878 HR             53.6    63 Female AH2              Active    
##  3 107821 HR             54.4    60 Female AH2              Active    
##  4 105305 HR             61.5    20 Female B01              Active    
##  5 136118 HR             61.4    47 Female B01              Termed    
##  6 151590 HR             49.8    62 Male   AH2              Active    
##  7 104586 HR             60.1    57 Male   B01              Active    
##  8 122418 HR             62.9    28 Female AH2              Active    
##  9 140606 HR             59.8    44 Female AH2              Active    
## 10 130070 HR             59.4    42 Female AH2              Active    
## # … with 20 more rows

Chaining Data Manipulation Steps

We can also chain manipulation steps/functions together using multiple pipe operators. We saw an example of this above when we created a new age variable and then re-positioned the new column. The possibilities for using chained data manipulation pipelines is endless, but here are a few simpler examples:

1. Filter for Male employees in the IT department and sort by salary from highest to lowest:

dat1 %>% 
  filter(department=="IT" & gender=="Male") %>% 
  arrange(desc(salary_k))
## # A tibble: 5 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 150894 IT            112.     35 Male   B01              Active    
## 2 139110 IT            103.     40 Male   AH2              Active    
## 3 130977 IT             99.5    44 Male   AH2              Active    
## 4 114776 IT             95.5    39 Male   AH2              Active    
## 5 150997 IT             93.5    24 Male   AH2              Active

2. Sort employees by salary from highest to lowest and then select the employees with the top 3 salaries:

dat1 %>% 
  arrange(desc(salary_k)) %>% 
  slice(1:3)
## # A tibble: 3 × 7
##   emp_id department salary_k   age gender business_unit_id emp_status
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>     
## 1 150894 IT             112.    35 Male   B01              Active    
## 2 116113 IT             104.    20 Female AH2              Active    
## 3 139110 IT             103.    40 Male   AH2              Active

3. Create a proposed salary increase of 15%, pull the rows associated with the top 3 proposed salaries, and then select only the employee id, current salary, and proposed salary variables for those employees:

dat1 %>% 
  mutate(salary_proposed = round(salary_k*1.15, 2)) %>% 
  slice_max(order_by = salary_proposed, n = 3) %>% 
  select(emp_id, salary_k, salary_proposed)
## # A tibble: 3 × 3
##   emp_id salary_k salary_proposed
##    <int>    <dbl>           <dbl>
## 1 150894     112.            129.
## 2 116113     104.            120.
## 3 139110     103.            119.

4. Filter for only those employees who make less than 80% of the overall average salary, count the gender frequencies for those employees, and then create a new column converting the gender count into a percentage:

dat1 %>% 
  filter(salary_k < mean(salary_k)*.8) %>% 
  count(gender, name = "employee_count") %>% 
  mutate(pct_emp_count = round(employee_count/sum(employee_count),2))
## # A tibble: 2 × 3
##   gender employee_count pct_emp_count
##   <chr>           <int>         <dbl>
## 1 Female              5          0.71
## 2 Male                2          0.29

Group_by & Ungroup

We can use group_by() to group our data by one or more variables, and then perform data manipulation steps at the group-level. When doing so, we can either use summarize() to compute a group-level summary statistic as our output, or we can use mutate() to add the group-level variable(s) alongside our pre-existing non-grouped data. Lastly, we can use ungroup() to un-group our data if we wish to be able perform subsequent operations that are not at the group level.

Group By & Summarize

1. Group employees by gender and calculate the avg salary:

dat1 %>% 
  group_by(gender) %>% 
  summarize(avg_salary = mean(salary_k))
## # A tibble: 3 × 2
##   gender     avg_salary
##   <chr>           <dbl>
## 1 Female           59.5
## 2 Male             79.6
## 3 Non-binary       76.6

2. Group employees by department and calculate the the proportion of employees in each department that identify as Males:

dat1 %>% 
  group_by(department) %>% 
  summarize(prop_male = mean(gender=="Male"))
## # A tibble: 3 × 2
##   department prop_male
##   <chr>          <dbl>
## 1 HR               0.2
## 2 IT               0.5
## 3 Sales            0.2

Group By & Mutate

1. Group by gender and calculate the average age, but capture this as a new variable added to the existing non-grouped dataset:

dat1 %>% 
  group_by(gender) %>% 
  mutate(avg_age = round(mean(age),2)) %>% 
  relocate(avg_age,.before = age)
## # A tibble: 30 × 8
## # Groups:   gender [3]
##    emp_id department salary_k avg_age   age gender business_unit_id emp_status
##     <int> <chr>         <dbl>   <dbl> <int> <chr>  <chr>            <chr>     
##  1 138211 HR             60.0    41      62 Female B01              Active    
##  2 155878 HR             53.6    41      63 Female AH2              Active    
##  3 107821 HR             54.4    41      60 Female AH2              Active    
##  4 105305 HR             61.5    41      20 Female B01              Active    
##  5 136118 HR             61.4    41      47 Female B01              Termed    
##  6 151590 HR             49.8    43.2    62 Male   AH2              Active    
##  7 104586 HR             60.1    43.2    57 Male   B01              Active    
##  8 122418 HR             62.9    41      28 Female AH2              Active    
##  9 140606 HR             59.8    41      44 Female AH2              Active    
## 10 130070 HR             59.4    41      42 Female AH2              Active    
## # … with 20 more rows

Multiple Grouping Variables

1. Calculate the avg salary by gender and department:

dat1 %>% 
  group_by(gender, department) %>% 
  summarize(avg_salary = round(mean(salary_k),2))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

## # A tibble: 8 × 3
## # Groups:   gender [3]
##   gender     department avg_salary
##   <chr>      <chr>           <dbl>
## 1 Female     HR               59.1
## 2 Female     IT              101. 
## 3 Female     Sales            36.2
## 4 Male       HR               54.9
## 5 Male       IT              101. 
## 6 Male       Sales            51.5
## 7 Non-binary IT               91.0
## 8 Non-binary Sales            62.1

Group By & UnGroup

1. Calculate the avg age by gender as a new variable added to the non-grouped data, and then create another variable representing the overall avg age:

dat1 %>% 
  group_by(gender) %>% 
  mutate(avg_age_grped = round(mean(age),2)) %>% 
  ungroup() %>% 
  mutate(avg_age_overall = round(mean(age),2)) %>% 
  relocate(c(avg_age_overall, avg_age_grped),.before = age) 
## # A tibble: 30 × 9
##    emp_id department salary_k avg_age_overall avg_age_grped   age gender
##     <int> <chr>         <dbl>           <dbl>         <dbl> <int> <chr> 
##  1 138211 HR             60.0            41.7          41      62 Female
##  2 155878 HR             53.6            41.7          41      63 Female
##  3 107821 HR             54.4            41.7          41      60 Female
##  4 105305 HR             61.5            41.7          41      20 Female
##  5 136118 HR             61.4            41.7          41      47 Female
##  6 151590 HR             49.8            41.7          43.2    62 Male  
##  7 104586 HR             60.1            41.7          43.2    57 Male  
##  8 122418 HR             62.9            41.7          41      28 Female
##  9 140606 HR             59.8            41.7          41      44 Female
## 10 130070 HR             59.4            41.7          41      42 Female
## # … with 20 more rows, and 2 more variables: business_unit_id <chr>,
## #   emp_status <chr>

Group By a Modified/Calculated Column

1. Calculate the avg salary by a newly-created age bucket that is a function of the existing age variable:

dat1 %>% 
  group_by(age_bucket = cut(age,2)) %>% 
  summarize(avg_salary = round(mean(salary_k),2))
## # A tibble: 2 × 2
##   age_bucket avg_salary
##   <fct>           <dbl>
## 1 (20,41.5]        67.7
## 2 (41.5,63]        65.7

Count

We can use count() to quickly create crosstabs/frequencies of data values by single or multiple variables.

1. Tabulate the number of employees/records in each business unit:

dat1 %>% 
  count(business_unit_id)
## # A tibble: 2 × 2
##   business_unit_id     n
##   <chr>            <int>
## 1 AH2                 19
## 2 B01                 11

2. Create a crosstab counting the number of employees/records in unique combination of business unit and department:

dat1 %>% 
  count(business_unit_id,department)
## # A tibble: 6 × 3
##   business_unit_id department     n
##   <chr>            <chr>      <int>
## 1 AH2              HR             6
## 2 AH2              IT             8
## 3 AH2              Sales          5
## 4 B01              HR             4
## 5 B01              IT             2
## 6 B01              Sales          5

3. Crosstab the number of employees by salary bracket and sort results by largest number of employees:

dat1 %>% 
  count(cut(salary_k, breaks = 3), sort = T) %>% 
  rename(salary_bucket = `cut(salary_k, breaks = 3)`)
## # A tibble: 3 × 2
##   salary_bucket     n
##   <fct>         <int>
## 1 (45.3,78.8]      14
## 2 (78.8,112]       10
## 3 (11.7,45.3]       6

Note: When used on factors, count() has a .drop = FALSE argument default that will cause empty factor levels to be dropped. This argument can be switched to .drop = T,


Let’s take a look at two quick examples. First, let’s convert our gender variable to a factor and add in a new level that no employees actually belong to:

dat1$gender <- factor(dat1$gender, 
                      levels = c("Female","Male",
                                 "Non-binary","Declined"))

levels(dat1$gender)
## [1] "Female"     "Male"       "Non-binary" "Declined"

1. Count the number of employees by gender grouping:

dat1 %>% 
  count(gender)
## # A tibble: 3 × 2
##   gender         n
##   <fct>      <int>
## 1 Female        19
## 2 Male           9
## 3 Non-binary     2

2. Count the number of employees by gender, INCLUDING the gender groups that are empty in our dataset:

dat1 %>% 
  count(gender, .drop = FALSE)
## # A tibble: 4 × 2
##   gender         n
##   <fct>      <int>
## 1 Female        19
## 2 Male           9
## 3 Non-binary     2
## 4 Declined       0

Thanks for reading this tutorial!