Data Manipulation Across Columns & Rows

Gordon Goodwin

Column-Wise & Row-Wise Operations with dplyr

This tutorial is the second in a multi-part series covering how to manipulate data using dplyr. This second part builds on the fundamentals covered in Part I and will focus on column-wise & row-wise operations.

Sample Data

Let’s first create a sample dataset consisting of some basic employee data:

set.seed(999)

dat1 <- tibble(emp_id = sample(100000:200000,
                               size = 30),
               department = c(rep("HR",10),
                              rep("IT",10),
                              rep("Sales",10)),
               salary_k = round(c(rnorm(10,60,5),
                                  rnorm(10,100,7),
                                  rnorm(10,40,20)),2),
               age = sample(20:66,30,T),
               gender = sample(c("Male","Female","Non-binary"),30,T,
                               prob = c(0.45,0.45,0.1)),
               business_unit_id = sample(c("B01","AH2"),30,T),
               emp_status = sample(c("Active","Termed"),30,T,
                                   prob = c(0.8,0.2)),
               tenure_yrs = sample(1:15,30,TRUE),
               jobsat = sample(1:5,30,T,prob = c(0.1,0.25,0.3,0.25,0.1)),
               interv_score = sample(1:5,30,T),
               test_score = sample(1:5,30,T),
               proj_score = sample(1:5,30,T))

head(dat1,3)

## # A tibble: 3 × 12
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## #   proj_score <int>

Column-wise Operations

Column-wise transformations are when we perform a transformation to multiple columns simultaneously, instead of having to manually apply the fn(s) to each column separately. Column-wise operations can be performed in dplyr using across() and its helper functions.

Manual Example

Let’s first take a look at a manual example where we calculate a group average for multiple column variables:

Calculate the avg age, salary, and tenure by department:

dat1 %>% 
  group_by(department) %>% 
  summarize(avg_age = round(mean(age), 2),
            avg_salary = round(mean(salary_k), 2),
            avg_tenure = round(mean(tenure_yrs), 2))

## # A tibble: 3 × 4
##   department avg_age avg_salary avg_tenure
##   <chr>        <dbl>      <dbl>      <dbl>
## 1 HR            48.5       58.3        5.9
## 2 IT            39.3       99.9        7.5
## 3 Sales         37.2       41.8        9.1

That might not seem that bad, but imagine trying to do this when you have dozens of variables!

Across()

We can automate the previous example with across().

Calculate the avg age, salary, and tenure by department:

dat1 %>% 
  group_by(department) %>% 
  summarize(across(.cols = c(age, salary_k, tenure_yrs),
                .fns = mean,
                .names = "{col}_avg"))

## # A tibble: 3 × 4
##   department age_avg salary_k_avg tenure_yrs_avg
##   <chr>        <dbl>        <dbl>          <dbl>
## 1 HR            48.5         58.3            5.9
## 2 IT            39.3         99.9            7.5
## 3 Sales         37.2         41.9            9.1

Across() has 2 primary arguments:

.cols: selects the columns you want to operate on
- Cols can be specified by position, name, or type
.fns: function or list of fns to apply to each col
- Can also be a purrr-style formula/list of formulas

There’s also an optional secondary argument:

.names: specifies the names of the output columns
- easiest way is often to concatenate the columns & fns

We can specify fns explicitly, as above, or we can use lamdas/formulas:

Calculate avg age, salary, and tenure by department, with the average rounded to 2 decimal places:

dat1 %>% 
  group_by(department) %>% 
  summarize(across(.cols = c(age, salary_k, tenure_yrs),
                   .fns = ~round(mean(.x),2),
                   .names = "{.col}_rnd_avg"))

## # A tibble: 3 × 4
##   department age_rnd_avg salary_k_rnd_avg tenure_yrs_rnd_avg
##   <chr>            <dbl>            <dbl>              <dbl>
## 1 HR                48.5             58.3                5.9
## 2 IT                39.3             99.9                7.5
## 3 Sales             37.2             41.8                9.1

We can also use lists to call multiple functions:

Calculate the rounded avg and max values for age, salary, and tenure by department:

dat1 %>% 
  group_by(department) %>% 
  summarize(across(.cols = c(age, salary_k, tenure_yrs), 
                   .fns = list(rnd_avg = ~round(mean(.x),2),
                        max = ~max(.x)),
                   .names = "{col}_{.fn}"))

## # A tibble: 3 × 7
##   department age_rnd_avg age_max salary_k_rnd_avg salary_k_max tenure_yrs_rnd_a…
##   <chr>            <dbl>   <int>            <dbl>        <dbl>             <dbl>
## 1 HR                48.5      63             58.3         62.9               5.9
## 2 IT                39.3      59             99.9        112.                7.5
## 3 Sales             37.2      63             41.8         63.3               9.1
## # … with 1 more variable: tenure_yrs_max <int>

We can either create the list of fns in the across() call, or we can define it before calling to across():

Compute your favorite descriptive stats for age, tenure, and salary by department:

favstats <- list(mean = ~round(mean(.x),2),
                 sd = ~round(sd(.x),2),
                 median = ~round(median(.x),2),
                 min = ~min(.x),
                 max = ~max(.x))

dat1 %>% 
  group_by(department) %>% 
  summarize(across(.cols = c(age, tenure_yrs, salary_k),
                .fns = favstats,
                .names = "{.fn}_{.col}"))

## # A tibble: 3 × 16
##   department mean_age sd_age median_age min_age max_age mean_tenure_yrs
##   <chr>         <dbl>  <dbl>      <dbl>   <int>   <int>           <dbl>
## 1 HR             48.5   15.2       52        20      63             5.9
## 2 IT             39.3   14.5       39.5      20      59             7.5
## 3 Sales          37.2   15.0       31.5      23      63             9.1
## # … with 9 more variables: sd_tenure_yrs <dbl>, median_tenure_yrs <dbl>,
## #   min_tenure_yrs <int>, max_tenure_yrs <int>, mean_salary_k <dbl>,
## #   sd_salary_k <dbl>, median_salary_k <dbl>, min_salary_k <dbl>,
## #   max_salary_k <dbl>

Across Helper Fns

We can use tidy-select helper functions and/or logical operators in conjunction with across() to specify columns more efficiently:

where(): Calculate the rounded average for all numeric columns by department except employee id:

dat1 %>%
  group_by(department) %>% 
  summarize(across(where(is.numeric) & !emp_id,
                   .fns = list(mean = ~round(mean(.x, na.rm = T), 2)),
                   .names = "{.col}_{.fn}")) %>% 
  head(3)

## # A tibble: 3 × 8
##   department salary_k_mean age_mean tenure_yrs_mean jobsat_mean interv_score_me…
##   <chr>              <dbl>    <dbl>           <dbl>       <dbl>            <dbl>
## 1 HR                  58.3     48.5             5.9         2.4              3.3
## 2 IT                  99.9     39.3             7.5         2.7              3.8
## 3 Sales               41.8     37.2             9.1         3.3              2.9
## # … with 2 more variables: test_score_mean <dbl>, proj_score_mean <dbl>

contains(): Find the number of distinct values by gender for any variable containing “id”

dat1 %>% 
  group_by(gender) %>% 
  summarize(across(.cols = contains("id"),
                   .fns = n_distinct))

## # A tibble: 3 × 3
##   gender     emp_id business_unit_id
##   <chr>       <int>            <int>
## 1 Female         19                2
## 2 Male            9                2
## 3 Non-binary      2                2

ends_with(): Rank each department’s employees by their interview, project, and test scores:

dat1 %>% 
  group_by(department) %>% 
  mutate(across(.cols = ends_with("score"),
                .fns = ~dense_rank(.x),
                .names = "{.col}_rank")) %>% 
  select(emp_id,department,contains("rank"))

## # A tibble: 30 × 5
## # Groups:   department [3]
##    emp_id department interv_score_rank test_score_rank proj_score_rank
##     <int> <chr>                  <int>           <int>           <int>
##  1 138211 HR                         4               3               3
##  2 155878 HR                         3               4               1
##  3 107821 HR                         4               4               3
##  4 105305 HR                         3               1               2
##  5 136118 HR                         2               1               3
##  6 151590 HR                         4               2               2
##  7 104586 HR                         2               1               3
##  8 122418 HR                         2               3               1
##  9 140606 HR                         1               4               4
## 10 130070 HR                         4               3               1
## # … with 20 more rows

  head(3)

## [1] 3

Across & Context-Dependent Transform

We can also employ context-dependent transformations by accessing the name of the “current” column via cur_column().

Apply a multiplier to each employee’s tenure and salary, where the multiplier is is different for tenure (1.2x) and salary (1.5x):

mult <- list(tenure_yrs = 1.2, salary_k = 1.5)

dat1 %>% 
  mutate(across(.cols = all_of(names(mult)), 
                .fns = list(raise = ~ .x * mult[[cur_column()]]),
                .names = "{.col}_{.fn}")) %>% 
  select(emp_id,salary_k, salary_k_raise,
         tenure_yrs, tenure_yrs_raise) %>% 
  head()

## # A tibble: 6 × 5
##   emp_id salary_k salary_k_raise tenure_yrs tenure_yrs_raise
##    <int>    <dbl>          <dbl>      <int>            <dbl>
## 1 138211     60.0           90.1         10             12  
## 2 155878     53.6           80.4          5              6  
## 3 107821     54.4           81.7          5              6  
## 4 105305     61.5           92.2          5              6  
## 5 136118     61.4           92.1          1              1.2
## 6 151590     49.8           74.6         10             12

Across & Other Fns

Across() can be used with any data-masking fn, not just summarize() and mutate().

Examples

distinct(): Return the distinct values/levels for each character/factor variables EXCEPT business unit code:

dat1 %>% 
  distinct(across(where(is.character) & !business_unit_id)) %>% 
  head(3)

## # A tibble: 3 × 3
##   department gender emp_status
##   <chr>      <chr>  <chr>     
## 1 HR         Female Active    
## 2 HR         Female Termed    
## 3 HR         Male   Active

count(): Count the number of employees in each character col level EXCEPT business unit code (e.g. # of emps by department, gender, etc):

dat1 %>% 
  count(across(.cols = where(is.character) & !business_unit_id)) %>% 
  head(3)

## # A tibble: 3 × 4
##   department gender emp_status     n
##   <chr>      <chr>  <chr>      <int>
## 1 HR         Female Active         7
## 2 HR         Female Termed         1
## 3 HR         Male   Active         2

Mutate to Over-write: Replace all numeric cols with a version rounded to 0 decimal places:

dat1 %>% 
  mutate(across(where(is.numeric),
                round)) %>% 
  select(where(is.numeric)) %>% 
  head(3)

## # A tibble: 3 × 8
##   emp_id salary_k   age tenure_yrs jobsat interv_score test_score proj_score
##    <dbl>    <dbl> <dbl>      <dbl>  <dbl>        <dbl>      <dbl>      <dbl>
## 1 138211       60    62         10      1            5          3          4
## 2 155878       54    63          5      5            3          4          2
## 3 107821       54    60          5      1            5          4          4

Across & Filter: If_Any and If_All

We can’t use across() and filter() directly together, but there are 2 helper functions:

if_any(): keeps rows where predicate fn (filter fn) is true for at least one column
if_all(): keeps rows where predicate fn is true for all columns

If_Any()

We can use if_any() to filter for employees who meet a predicate condition in any of the specified columns.

Filter for employees who have EITHER an above-average salary OR job satisfaction value:

dat1 %>% 
  filter(if_any(.cols = c(salary_k,jobsat),
                .fns = ~.x > mean(.x))) %>% 
  select(emp_id,salary_k,jobsat) %>% 
  arrange(desc(salary_k), desc(jobsat)) %>% 
  head(3)

## # A tibble: 3 × 3
##   emp_id salary_k jobsat
##    <int>    <dbl>  <int>
## 1 150894     112.      4
## 2 116113     104.      1
## 3 139110     103.      3

Let’s verify the results using a more manual way:

dat1 %>% 
  mutate(avg_salary = mean(salary_k),
         avg_jobsat = mean(jobsat)) %>% 
  select(emp_id,avg_salary,salary_k,avg_jobsat,jobsat) %>% 
  filter(salary_k > avg_salary | jobsat > avg_jobsat) %>% 
  arrange(desc(salary_k),desc(jobsat)) %>% 
  head(3)

## # A tibble: 3 × 5
##   emp_id avg_salary salary_k avg_jobsat jobsat
##    <int>      <dbl>    <dbl>      <dbl>  <int>
## 1 150894       66.7     112.        2.8      4
## 2 116113       66.7     104.        2.8      1
## 3 139110       66.7     103.        2.8      3

If_All()

We can use if_all() to filter for employees who meet a predicate condition in all of the specified columns.

Keep only employees who have BOTH an above-average salary AND job satisfaction value:

dat1 %>% 
  filter((if_all(.cols = c(salary_k,
                           jobsat),
                 .fns = ~.x > mean(.x)))) %>% 
  select(salary_k,jobsat) %>% 
  arrange(desc(salary_k),desc(jobsat)) %>% 
  head(3)

## # A tibble: 3 × 2
##   salary_k jobsat
##      <dbl>  <int>
## 1     112.      4
## 2     103.      3
## 3     102.      3

As before, we can verify the results manually:

dat1 %>% 
  mutate(avg_salary = round(mean(salary_k),2),
         avg_jobsat = round(mean(jobsat),2)) %>% 
  filter(salary_k > avg_salary,
         jobsat > avg_jobsat) %>% 
  select(emp_id,avg_salary,salary_k,avg_jobsat,jobsat) %>% 
  head(3)

## # A tibble: 3 × 5
##   emp_id avg_salary salary_k avg_jobsat jobsat
##    <int>      <dbl>    <dbl>      <dbl>  <int>
## 1 114776       66.7     95.5        2.8      3
## 2 150894       66.7    112.         2.8      4
## 3 118784       66.7    102.         2.8      3

Additional Examples

Here a few final column-wise examples:

Calculate your favorite descriptive stats across any variables that in your list of favorite variables:

favstats <- list(mean = ~round(mean(.x),2),
                 sd = ~round(sd(.x),2),
                 median = ~round(median(.x),2),
                 min = ~min(.x),
                 max = ~max(.x))

favcols <- c("salary_k","age","tenure_yrs","perf_review")

dat1 %>% 
  group_by(department) %>% 
  summarize(across(.cols = any_of(favcols),
                   .fns = favstats))

## # A tibble: 3 × 16
##   department salary_k_mean salary_k_sd salary_k_median salary_k_min salary_k_max
##   <chr>              <dbl>       <dbl>           <dbl>        <dbl>        <dbl>
## 1 HR                  58.3        4.23            59.9         49.8         62.9
## 2 IT                  99.9        6.17           101.          91.0        112. 
## 3 Sales               41.8       19.0             38.8         11.8         63.3
## # … with 10 more variables: age_mean <dbl>, age_sd <dbl>, age_median <dbl>,
## #   age_min <int>, age_max <int>, tenure_yrs_mean <dbl>, tenure_yrs_sd <dbl>,
## #   tenure_yrs_median <dbl>, tenure_yrs_min <int>, tenure_yrs_max <int>

Filter for only employees who have complete records for ALL variables:

dat1 %>% 
  filter(if_all(.fns = ~!is.na(.x))) %>% 
  head(3)

## # A tibble: 3 × 12
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## #   proj_score <int>

Here’s another way to perform the same operation:

Filter for only employees who have complete records for ALL variables:

dat1 %>% 
  filter(complete.cases(across(.cols = everything()))) %>% 
  head(3)

## # A tibble: 3 × 12
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## # … with 4 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## #   proj_score <int>

Filter for employees who are missing ANY data and select the cols with missing data:

dat1 %>% 
  add_row(emp_id = 144333,department = "HR", salary_k = 65, 
          age = NA, gender = NA, business_unit_id = NA, 
          emp_status = "Active",tenure_yrs = 14,jobsat = 1) %>% 
  filter(if_any(.fns = ~is.na(.x))) %>% 
  select(emp_id, where(is.na))

## # A tibble: 1 × 7
##   emp_id   age gender business_unit_id interv_score test_score proj_score
##    <dbl> <int> <chr>  <chr>                   <int>      <int>      <int>
## 1 144333    NA <NA>   <NA>                       NA         NA         NA

Row-wise()

We can use rowwise() to perform data transformations across each row/record.

rowwise() is basically a form of grouping with 1 group per 1 row

Example: we want to create a total pre-hire score that is the sum of the interview, project, and test score values for each candidate.

Let’s see what happens when we try and use mutate() to create a total score column:

dat1 %>% 
  mutate(total_score = sum(interv_score,proj_score,test_score)) %>% 
  select(emp_id,contains("score")) %>% 
  head(3)

## # A tibble: 3 × 5
##   emp_id interv_score test_score proj_score total_score
##    <int>        <int>      <int>      <int>       <int>
## 1 138211            5          3          4         270
## 2 155878            3          4          2         270
## 3 107821            5          4          4         270

Unfortunately, our total score column shown above only contains the grand total summed across all rows.

Let’s use rowwise() to get the summation for each row:

dat1 %>% 
  rowwise() %>% 
  mutate(total_score = sum(test_score,interv_score,proj_score)) %>% 
  select(emp_id, contains("score")) %>% 
  head(3)

## # A tibble: 3 × 5
## # Rowwise: 
##   emp_id interv_score test_score proj_score total_score
##    <int>        <int>      <int>      <int>       <int>
## 1 138211            5          3          4          12
## 2 155878            3          4          2           9
## 3 107821            5          4          4          13

Rowwise & C_Across()

Similarly to how we use across() in columnwise operations, we can use c_across() for row-wise contexts (as well as the same tidy-select helper functions):

dat1 %>% 
  rowwise() %>% 
  mutate(total_score = sum(c_across(cols = contains("score")))) %>% 
  head(3)

## # A tibble: 3 × 13
## # Rowwise: 
##   emp_id department salary_k   age gender business_unit_id emp_status tenure_yrs
##    <int> <chr>         <dbl> <int> <chr>  <chr>            <chr>           <int>
## 1 138211 HR             60.0    62 Female B01              Active             10
## 2 155878 HR             53.6    63 Female AH2              Active              5
## 3 107821 HR             54.4    60 Female AH2              Active              5
## # … with 5 more variables: jobsat <int>, interv_score <int>, test_score <int>,
## #   proj_score <int>, total_score <int>

Identifier Variables

We can also specify an optional identifier variable for each row-grouping when calling rowwise():

dat1 %>% 
  rowwise(emp_id) %>% 
  mutate(total_score = sum(c_across(contains("score")))) %>% 
  select(contains("score")) %>% 
  head(3)

## Adding missing grouping variables: `emp_id`

## # A tibble: 3 × 5
## # Rowwise:  emp_id
##   emp_id interv_score test_score proj_score total_score
##    <int>        <int>      <int>      <int>       <int>
## 1 138211            5          3          4          12
## 2 155878            3          4          2           9
## 3 107821            5          4          4          13

Rowwise() & Summarize()

Used alone or with group_by(), summarize() will return a summary stat calculated across all rows, or a grouping of rows, under a single column.
Used with rowwise(), summarize() can provide a summary stat across columns within each row.

Examples

Summarize() Only: Calculate the overall avg salary across all rows in the dataset:

dat1 %>% 
  summarize(avg_salary = round(mean(salary_k),2))

## # A tibble: 1 × 1
##   avg_salary
##        <dbl>
## 1       66.7

Group_by() and Summarize(): Calculate the avg salary across the rows associated w/each department:

dat1 %>% 
  group_by(department) %>% 
  summarize(avg_salary = round(mean(salary_k),2))

## # A tibble: 3 × 2
##   department avg_salary
##   <chr>           <dbl>
## 1 HR               58.3
## 2 IT               99.9
## 3 Sales            41.8

Rowwise() and Summarize(): Calculate the total score across all 3 candidate score columns for each row:

dat1 %>% 
  rowwise(emp_id) %>% 
  summarize(total_score = sum(c_across(contains("score")))) %>% 
  head(3)

## `summarise()` has grouped output by 'emp_id'. You can override using the
## `.groups` argument.

## # A tibble: 3 × 2
## # Groups:   emp_id [3]
##   emp_id total_score
##    <int>       <int>
## 1 138211          12
## 2 155878           9
## 3 107821          13

Rowwise & Repeated Fn Calls

We can also use rowwise() to repeatedly call a function with varying inputs and store the outputs next to their respective inputs.

Example: We want to simulate response (y) values for a linear model based on varying sets of parameters.

We can store our varying parameters in a dataframe, and then use rowwise() and mutate() to create a list-col containing sets of simulated outcomes:

set.seed(999)

x <- sample(1:20, size = 20, replace = T)

df <- tibble(a = c(-5,0,5),
             b = c(1,2,5),
             sigma = c(1,10,20))

head(df)

## # A tibble: 3 × 3
##       a     b sigma
##   <dbl> <dbl> <dbl>
## 1    -5     1     1
## 2     0     2    10
## 3     5     5    20

df %>% 
  rowwise() %>% 
  mutate(y = list(a + b*x + rnorm(n = length(x),
                                  mean = 0,
                                  sd = sigma))) -> df

df$y

## [[1]]
##  [1] -1.7538234  2.2118573  5.1217466  9.7607318 -4.1626509  5.8600398
##  [7] -2.0685279  9.4367587 -0.6875366  0.4044542  1.3891443 15.1364714
## [13]  6.2450001 13.6936596 -3.0298866  0.7660021 15.5780852  6.9878264
## [19]  5.1244647 -0.4364653
## 
## [[2]]
##  [1]  3.180417 12.857102  9.225466 27.394907 -4.849864 23.033745  4.343193
##  [8] 22.484959 28.344732 13.891738  4.249469 43.134170 12.280859 36.030004
## [15] 14.325024 27.265283 52.351406 23.380690  7.261221 11.752822
## 
## [[3]]
##  [1]   9.6252570  28.6199738  68.1448235  30.6929666  38.7382438  73.3709874
##  [7]  -7.6870670 114.6055186  46.7053505  37.8111396  10.9411508  96.1767258
## [13]  50.8625235  62.1513326  -0.5466383  27.6029307 151.5204460  74.3736682
## [19]  35.4388964   6.3031827

We could then plot the data generated from each parameter set:

for(i in seq_along(df$y)){
  plot(x,df$y[[i]])
    abline(a = df$a[[i]],
           b = df$b[[i]])
}

Combining Rowwise & Columnwise

We can also perform both columnwise and rowwise operations in a single transformation pipeline.

Example: calculate the total candidate score across all 3 score columns for each candidate/row, and then compute the proportionate contribution to that total score that each sub-score column represents:

dat1 %>% 
  rowwise() %>% 
  mutate(total_score = sum(c_across(contains("score")))) %>% 
  ungroup() %>% 
  mutate(across(.cols = contains("score") & !total_score,
                .fns = ~round(./total_score,2),
                .names = "{.col}_pct_tot")) %>% 
  select(emp_id,
         contains("score")) %>% 
  head(3)

## # A tibble: 3 × 8
##   emp_id interv_score test_score proj_score total_score interv_score_pct_tot
##    <int>        <int>      <int>      <int>       <int>                <dbl>
## 1 138211            5          3          4          12                 0.42
## 2 155878            3          4          2           9                 0.33
## 3 107821            5          4          4          13                 0.38
## # … with 2 more variables: test_score_pct_tot <dbl>, proj_score_pct_tot <dbl>

Thanks for following along!