Gordon Goodwin
This tutorial is the second in a multi-part series covering how to manipulate data using dplyr. This second part builds on the fundamentals covered in Part I and will focus on column-wise & row-wise operations.
Let’s first create a sample dataset consisting of some basic employee data:
set.seed(999)
dat1 <- tibble(emp_id = sample(100000:200000,
size = 30),
department = c(rep("HR",10),
rep("IT",10),
rep("Sales",10)),
salary_k = round(c(rnorm(10,60,5),
rnorm(10,100,7),
rnorm(10,40,20)),2),
age = sample(20:66,30,T),
gender = sample(c("Male","Female","Non-binary"),30,T,
prob = c(0.45,0.45,0.1)),
business_unit_id = sample(c("B01","AH2"),30,T),
emp_status = sample(c("Active","Termed"),30,T,
prob = c(0.8,0.2)),
tenure_yrs = sample(1:15,30,TRUE),
jobsat = sample(1:5,30,T,prob = c(0.1,0.25,0.3,0.25,0.1)),
interv_score = sample(1:5,30,T),
test_score = sample(1:5,30,T),
proj_score = sample(1:5,30,T))
head(dat1,3)## # A tibble: 3 × 12
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## # proj_score <int>
Column-wise transformations are when we perform a transformation to multiple columns simultaneously, instead of having to manually apply the fn(s) to each column separately. Column-wise operations can be performed in dplyr using across() and its helper functions.
Let’s first take a look at a manual example where we calculate a group average for multiple column variables:
Calculate the avg age, salary, and tenure by department:
dat1 %>%
group_by(department) %>%
summarize(avg_age = round(mean(age), 2),
avg_salary = round(mean(salary_k), 2),
avg_tenure = round(mean(tenure_yrs), 2))## # A tibble: 3 × 4
## department avg_age avg_salary avg_tenure
## <chr> <dbl> <dbl> <dbl>
## 1 HR 48.5 58.3 5.9
## 2 IT 39.3 99.9 7.5
## 3 Sales 37.2 41.8 9.1
That might not seem that bad, but imagine trying to do this when you have dozens of variables!
We can automate the previous example with across().
Calculate the avg age, salary, and tenure by department:
dat1 %>%
group_by(department) %>%
summarize(across(.cols = c(age, salary_k, tenure_yrs),
.fns = mean,
.names = "{col}_avg")) ## # A tibble: 3 × 4
## department age_avg salary_k_avg tenure_yrs_avg
## <chr> <dbl> <dbl> <dbl>
## 1 HR 48.5 58.3 5.9
## 2 IT 39.3 99.9 7.5
## 3 Sales 37.2 41.9 9.1
Across() has 2 primary arguments:
There’s also an optional secondary argument:
We can specify fns explicitly, as above, or we can use lamdas/formulas:
Calculate avg age, salary, and tenure by department, with the average rounded to 2 decimal places:
dat1 %>%
group_by(department) %>%
summarize(across(.cols = c(age, salary_k, tenure_yrs),
.fns = ~round(mean(.x),2),
.names = "{.col}_rnd_avg"))## # A tibble: 3 × 4
## department age_rnd_avg salary_k_rnd_avg tenure_yrs_rnd_avg
## <chr> <dbl> <dbl> <dbl>
## 1 HR 48.5 58.3 5.9
## 2 IT 39.3 99.9 7.5
## 3 Sales 37.2 41.8 9.1
We can also use lists to call multiple functions:
Calculate the rounded avg and max values for age, salary, and tenure by department:
dat1 %>%
group_by(department) %>%
summarize(across(.cols = c(age, salary_k, tenure_yrs),
.fns = list(rnd_avg = ~round(mean(.x),2),
max = ~max(.x)),
.names = "{col}_{.fn}"))## # A tibble: 3 × 7
## department age_rnd_avg age_max salary_k_rnd_avg salary_k_max tenure_yrs_rnd_a…
## <chr> <dbl> <int> <dbl> <dbl> <dbl>
## 1 HR 48.5 63 58.3 62.9 5.9
## 2 IT 39.3 59 99.9 112. 7.5
## 3 Sales 37.2 63 41.8 63.3 9.1
## # … with 1 more variable: tenure_yrs_max <int>
We can either create the list of fns in the across() call, or we can define it before calling to across():
Compute your favorite descriptive stats for age, tenure, and salary by department:
favstats <- list(mean = ~round(mean(.x),2),
sd = ~round(sd(.x),2),
median = ~round(median(.x),2),
min = ~min(.x),
max = ~max(.x))
dat1 %>%
group_by(department) %>%
summarize(across(.cols = c(age, tenure_yrs, salary_k),
.fns = favstats,
.names = "{.fn}_{.col}"))## # A tibble: 3 × 16
## department mean_age sd_age median_age min_age max_age mean_tenure_yrs
## <chr> <dbl> <dbl> <dbl> <int> <int> <dbl>
## 1 HR 48.5 15.2 52 20 63 5.9
## 2 IT 39.3 14.5 39.5 20 59 7.5
## 3 Sales 37.2 15.0 31.5 23 63 9.1
## # … with 9 more variables: sd_tenure_yrs <dbl>, median_tenure_yrs <dbl>,
## # min_tenure_yrs <int>, max_tenure_yrs <int>, mean_salary_k <dbl>,
## # sd_salary_k <dbl>, median_salary_k <dbl>, min_salary_k <dbl>,
## # max_salary_k <dbl>
We can use tidy-select helper functions and/or logical operators in conjunction with across() to specify columns more efficiently:
where(): Calculate the rounded average for all numeric columns by department except employee id:
dat1 %>%
group_by(department) %>%
summarize(across(where(is.numeric) & !emp_id,
.fns = list(mean = ~round(mean(.x, na.rm = T), 2)),
.names = "{.col}_{.fn}")) %>%
head(3)## # A tibble: 3 × 8
## department salary_k_mean age_mean tenure_yrs_mean jobsat_mean interv_score_me…
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HR 58.3 48.5 5.9 2.4 3.3
## 2 IT 99.9 39.3 7.5 2.7 3.8
## 3 Sales 41.8 37.2 9.1 3.3 2.9
## # … with 2 more variables: test_score_mean <dbl>, proj_score_mean <dbl>
contains(): Find the number of distinct values by gender for any variable containing “id”
dat1 %>%
group_by(gender) %>%
summarize(across(.cols = contains("id"),
.fns = n_distinct))## # A tibble: 3 × 3
## gender emp_id business_unit_id
## <chr> <int> <int>
## 1 Female 19 2
## 2 Male 9 2
## 3 Non-binary 2 2
ends_with(): Rank each department’s employees by their interview, project, and test scores:
dat1 %>%
group_by(department) %>%
mutate(across(.cols = ends_with("score"),
.fns = ~dense_rank(.x),
.names = "{.col}_rank")) %>%
select(emp_id,department,contains("rank"))## # A tibble: 30 × 5
## # Groups: department [3]
## emp_id department interv_score_rank test_score_rank proj_score_rank
## <int> <chr> <int> <int> <int>
## 1 138211 HR 4 3 3
## 2 155878 HR 3 4 1
## 3 107821 HR 4 4 3
## 4 105305 HR 3 1 2
## 5 136118 HR 2 1 3
## 6 151590 HR 4 2 2
## 7 104586 HR 2 1 3
## 8 122418 HR 2 3 1
## 9 140606 HR 1 4 4
## 10 130070 HR 4 3 1
## # … with 20 more rows
head(3)## [1] 3
We can also employ context-dependent transformations by accessing the name of the “current” column via cur_column().
Apply a multiplier to each employee’s tenure and salary, where the multiplier is is different for tenure (1.2x) and salary (1.5x):
mult <- list(tenure_yrs = 1.2, salary_k = 1.5)
dat1 %>%
mutate(across(.cols = all_of(names(mult)),
.fns = list(raise = ~ .x * mult[[cur_column()]]),
.names = "{.col}_{.fn}")) %>%
select(emp_id,salary_k, salary_k_raise,
tenure_yrs, tenure_yrs_raise) %>%
head()## # A tibble: 6 × 5
## emp_id salary_k salary_k_raise tenure_yrs tenure_yrs_raise
## <int> <dbl> <dbl> <int> <dbl>
## 1 138211 60.0 90.1 10 12
## 2 155878 53.6 80.4 5 6
## 3 107821 54.4 81.7 5 6
## 4 105305 61.5 92.2 5 6
## 5 136118 61.4 92.1 1 1.2
## 6 151590 49.8 74.6 10 12
Across() can be used with any data-masking fn, not just summarize() and mutate().
distinct(): Return the distinct values/levels for each character/factor variables EXCEPT business unit code:
dat1 %>%
distinct(across(where(is.character) & !business_unit_id)) %>%
head(3)## # A tibble: 3 × 3
## department gender emp_status
## <chr> <chr> <chr>
## 1 HR Female Active
## 2 HR Female Termed
## 3 HR Male Active
count(): Count the number of employees in each character col level EXCEPT business unit code (e.g. # of emps by department, gender, etc):
dat1 %>%
count(across(.cols = where(is.character) & !business_unit_id)) %>%
head(3)## # A tibble: 3 × 4
## department gender emp_status n
## <chr> <chr> <chr> <int>
## 1 HR Female Active 7
## 2 HR Female Termed 1
## 3 HR Male Active 2
Mutate to Over-write: Replace all numeric cols with a version rounded to 0 decimal places:
dat1 %>%
mutate(across(where(is.numeric),
round)) %>%
select(where(is.numeric)) %>%
head(3)## # A tibble: 3 × 8
## emp_id salary_k age tenure_yrs jobsat interv_score test_score proj_score
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 138211 60 62 10 1 5 3 4
## 2 155878 54 63 5 5 3 4 2
## 3 107821 54 60 5 1 5 4 4
We can’t use across() and filter() directly together, but there are 2 helper functions:
We can use if_any() to filter for employees who meet a predicate condition in any of the specified columns.
Filter for employees who have EITHER an above-average salary OR job satisfaction value:
dat1 %>%
filter(if_any(.cols = c(salary_k,jobsat),
.fns = ~.x > mean(.x))) %>%
select(emp_id,salary_k,jobsat) %>%
arrange(desc(salary_k), desc(jobsat)) %>%
head(3)## # A tibble: 3 × 3
## emp_id salary_k jobsat
## <int> <dbl> <int>
## 1 150894 112. 4
## 2 116113 104. 1
## 3 139110 103. 3
Let’s verify the results using a more manual way:
dat1 %>%
mutate(avg_salary = mean(salary_k),
avg_jobsat = mean(jobsat)) %>%
select(emp_id,avg_salary,salary_k,avg_jobsat,jobsat) %>%
filter(salary_k > avg_salary | jobsat > avg_jobsat) %>%
arrange(desc(salary_k),desc(jobsat)) %>%
head(3)## # A tibble: 3 × 5
## emp_id avg_salary salary_k avg_jobsat jobsat
## <int> <dbl> <dbl> <dbl> <int>
## 1 150894 66.7 112. 2.8 4
## 2 116113 66.7 104. 2.8 1
## 3 139110 66.7 103. 2.8 3
We can use if_all() to filter for employees who meet a predicate condition in all of the specified columns.
Keep only employees who have BOTH an above-average salary AND job satisfaction value:
dat1 %>%
filter((if_all(.cols = c(salary_k,
jobsat),
.fns = ~.x > mean(.x)))) %>%
select(salary_k,jobsat) %>%
arrange(desc(salary_k),desc(jobsat)) %>%
head(3)## # A tibble: 3 × 2
## salary_k jobsat
## <dbl> <int>
## 1 112. 4
## 2 103. 3
## 3 102. 3
As before, we can verify the results manually:
dat1 %>%
mutate(avg_salary = round(mean(salary_k),2),
avg_jobsat = round(mean(jobsat),2)) %>%
filter(salary_k > avg_salary,
jobsat > avg_jobsat) %>%
select(emp_id,avg_salary,salary_k,avg_jobsat,jobsat) %>%
head(3)## # A tibble: 3 × 5
## emp_id avg_salary salary_k avg_jobsat jobsat
## <int> <dbl> <dbl> <dbl> <int>
## 1 114776 66.7 95.5 2.8 3
## 2 150894 66.7 112. 2.8 4
## 3 118784 66.7 102. 2.8 3
Here a few final column-wise examples:
Calculate your favorite descriptive stats across any variables that in your list of favorite variables:
favstats <- list(mean = ~round(mean(.x),2),
sd = ~round(sd(.x),2),
median = ~round(median(.x),2),
min = ~min(.x),
max = ~max(.x))
favcols <- c("salary_k","age","tenure_yrs","perf_review")
dat1 %>%
group_by(department) %>%
summarize(across(.cols = any_of(favcols),
.fns = favstats))## # A tibble: 3 × 16
## department salary_k_mean salary_k_sd salary_k_median salary_k_min salary_k_max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HR 58.3 4.23 59.9 49.8 62.9
## 2 IT 99.9 6.17 101. 91.0 112.
## 3 Sales 41.8 19.0 38.8 11.8 63.3
## # … with 10 more variables: age_mean <dbl>, age_sd <dbl>, age_median <dbl>,
## # age_min <int>, age_max <int>, tenure_yrs_mean <dbl>, tenure_yrs_sd <dbl>,
## # tenure_yrs_median <dbl>, tenure_yrs_min <int>, tenure_yrs_max <int>
Filter for only employees who have complete records for ALL variables:
dat1 %>%
filter(if_all(.fns = ~!is.na(.x))) %>%
head(3)## # A tibble: 3 × 12
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## # proj_score <int>
Here’s another way to perform the same operation:
Filter for only employees who have complete records for ALL variables:
dat1 %>%
filter(complete.cases(across(.cols = everything()))) %>%
head(3)## # A tibble: 3 × 12
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## # proj_score <int>
Filter for employees who are missing ANY data and select the cols with missing data:
dat1 %>%
add_row(emp_id = 144333,department = "HR", salary_k = 65,
age = NA, gender = NA, business_unit_id = NA,
emp_status = "Active",tenure_yrs = 14,jobsat = 1) %>%
filter(if_any(.fns = ~is.na(.x))) %>%
select(emp_id, where(is.na)) ## # A tibble: 1 × 7
## emp_id age gender business_unit_id interv_score test_score proj_score
## <dbl> <int> <chr> <chr> <int> <int> <int>
## 1 144333 NA <NA> <NA> NA NA NA
We can use rowwise() to perform data transformations across each row/record.
Example: we want to create a total pre-hire score that is the sum of the interview, project, and test score values for each candidate.
Let’s see what happens when we try and use mutate() to create a total score column:
dat1 %>%
mutate(total_score = sum(interv_score,proj_score,test_score)) %>%
select(emp_id,contains("score")) %>%
head(3)## # A tibble: 3 × 5
## emp_id interv_score test_score proj_score total_score
## <int> <int> <int> <int> <int>
## 1 138211 5 3 4 270
## 2 155878 3 4 2 270
## 3 107821 5 4 4 270
Unfortunately, our total score column shown above only contains the grand total summed across all rows.
Let’s use rowwise() to get the summation for each row:
dat1 %>%
rowwise() %>%
mutate(total_score = sum(test_score,interv_score,proj_score)) %>%
select(emp_id, contains("score")) %>%
head(3)## # A tibble: 3 × 5
## # Rowwise:
## emp_id interv_score test_score proj_score total_score
## <int> <int> <int> <int> <int>
## 1 138211 5 3 4 12
## 2 155878 3 4 2 9
## 3 107821 5 4 4 13
Similarly to how we use across() in columnwise operations, we can use c_across() for row-wise contexts (as well as the same tidy-select helper functions):
dat1 %>%
rowwise() %>%
mutate(total_score = sum(c_across(cols = contains("score")))) %>%
head(3)## # A tibble: 3 × 13
## # Rowwise:
## emp_id department salary_k age gender business_unit_id emp_status tenure_yrs
## <int> <chr> <dbl> <int> <chr> <chr> <chr> <int>
## 1 138211 HR 60.0 62 Female B01 Active 10
## 2 155878 HR 53.6 63 Female AH2 Active 5
## 3 107821 HR 54.4 60 Female AH2 Active 5
## # … with 5 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## # proj_score <int>, total_score <int>
We can also specify an optional identifier variable for each row-grouping when calling rowwise():
dat1 %>%
rowwise(emp_id) %>%
mutate(total_score = sum(c_across(contains("score")))) %>%
select(contains("score")) %>%
head(3)## Adding missing grouping variables: `emp_id`
## # A tibble: 3 × 5
## # Rowwise: emp_id
## emp_id interv_score test_score proj_score total_score
## <int> <int> <int> <int> <int>
## 1 138211 5 3 4 12
## 2 155878 3 4 2 9
## 3 107821 5 4 4 13
Used alone or with group_by(), summarize() will return a summary stat calculated across all rows, or a grouping of rows, under a single column.
Used with rowwise(), summarize() can provide a summary stat across columns within each row.
Summarize() Only: Calculate the overall avg salary across all rows in the dataset:
dat1 %>%
summarize(avg_salary = round(mean(salary_k),2))## # A tibble: 1 × 1
## avg_salary
## <dbl>
## 1 66.7
Group_by() and Summarize(): Calculate the avg salary across the rows associated w/each department:
dat1 %>%
group_by(department) %>%
summarize(avg_salary = round(mean(salary_k),2))## # A tibble: 3 × 2
## department avg_salary
## <chr> <dbl>
## 1 HR 58.3
## 2 IT 99.9
## 3 Sales 41.8
Rowwise() and Summarize(): Calculate the total score across all 3 candidate score columns for each row:
dat1 %>%
rowwise(emp_id) %>%
summarize(total_score = sum(c_across(contains("score")))) %>%
head(3)## `summarise()` has grouped output by 'emp_id'. You can override using the
## `.groups` argument.
## # A tibble: 3 × 2
## # Groups: emp_id [3]
## emp_id total_score
## <int> <int>
## 1 138211 12
## 2 155878 9
## 3 107821 13
We can also use rowwise() to repeatedly call a function with varying inputs and store the outputs next to their respective inputs.
Example: We want to simulate response (y) values for a linear model based on varying sets of parameters.
We can store our varying parameters in a dataframe, and then use rowwise() and mutate() to create a list-col containing sets of simulated outcomes:
set.seed(999)
x <- sample(1:20, size = 20, replace = T)
df <- tibble(a = c(-5,0,5),
b = c(1,2,5),
sigma = c(1,10,20))
head(df)## # A tibble: 3 × 3
## a b sigma
## <dbl> <dbl> <dbl>
## 1 -5 1 1
## 2 0 2 10
## 3 5 5 20
df %>%
rowwise() %>%
mutate(y = list(a + b*x + rnorm(n = length(x),
mean = 0,
sd = sigma))) -> df
df$y## [[1]]
## [1] -1.7538234 2.2118573 5.1217466 9.7607318 -4.1626509 5.8600398
## [7] -2.0685279 9.4367587 -0.6875366 0.4044542 1.3891443 15.1364714
## [13] 6.2450001 13.6936596 -3.0298866 0.7660021 15.5780852 6.9878264
## [19] 5.1244647 -0.4364653
##
## [[2]]
## [1] 3.180417 12.857102 9.225466 27.394907 -4.849864 23.033745 4.343193
## [8] 22.484959 28.344732 13.891738 4.249469 43.134170 12.280859 36.030004
## [15] 14.325024 27.265283 52.351406 23.380690 7.261221 11.752822
##
## [[3]]
## [1] 9.6252570 28.6199738 68.1448235 30.6929666 38.7382438 73.3709874
## [7] -7.6870670 114.6055186 46.7053505 37.8111396 10.9411508 96.1767258
## [13] 50.8625235 62.1513326 -0.5466383 27.6029307 151.5204460 74.3736682
## [19] 35.4388964 6.3031827
We could then plot the data generated from each parameter set:
for(i in seq_along(df$y)){
plot(x,df$y[[i]])
abline(a = df$a[[i]],
b = df$b[[i]])
}We can also perform both columnwise and rowwise operations in a single transformation pipeline.
Example: calculate the total candidate score across all 3 score columns for each candidate/row, and then compute the proportionate contribution to that total score that each sub-score column represents:
dat1 %>%
rowwise() %>%
mutate(total_score = sum(c_across(contains("score")))) %>%
ungroup() %>%
mutate(across(.cols = contains("score") & !total_score,
.fns = ~round(./total_score,2),
.names = "{.col}_pct_tot")) %>%
select(emp_id,
contains("score")) %>%
head(3)## # A tibble: 3 × 8
## emp_id interv_score test_score proj_score total_score interv_score_pct_tot
## <int> <int> <int> <int> <int> <dbl>
## 1 138211 5 3 4 12 0.42
## 2 155878 3 4 2 9 0.33
## 3 107821 5 4 4 13 0.38
## # … with 2 more variables: test_score_pct_tot <dbl>, proj_score_pct_tot <dbl>
Thanks for following along!