Window Functions in R

Gordon Goodwin

Window Functions with dplyr

This tutorial will illustrate how to use window functions with dplyr. I highly recommend you also check out the dplyr window function vignette, which this guide follows.

Window Function Overview

A window function is a type of aggregation function that takes in n inputs and also returns n values. In this way, it differs from summary functions that take in n inputs but only return 1 value (e.g. mean, median, sum, etc.)

e.g. ranks, cumulative sums/means, lead/lags, etc.

We’ll cover 3 main types of window fns in this tutorial:

Ranking & Ordering
Offsets
Cumulatives

Sample Data

Let’s first create a sample dataset consisting of 10 weeks worth of sales data for 5 different sales employees:

set.seed(999)

dat1 <- tibble(emp_id = rep(c(100101,102423,105043,102341,111302),
                            rep(10, 5)),
               week = rep(1:10,5),
               sales = sample(0:10,50,T),
               calls = sample(100:300,50,T),
               mgr = c(rep("Bob",20),rep("Sally",20),rep("Omar",10)))

dat1 %>% 
  arrange(week,desc(sales)) %>% 
  head(5)

## # A tibble: 5 × 5
##   emp_id  week sales calls mgr  
##    <dbl> <int> <int> <int> <chr>
## 1 100101     1    10   186 Bob  
## 2 105043     1     9   222 Sally
## 3 111302     1     9   223 Omar 
## 4 102341     1     8   132 Sally
## 5 102423     1     7   116 Bob

Ranking & Ordering

Ranking and ordering functions take a vector to order by, and return various types of ranks. Importantly, they all map smallest inputs to smallest outputs, which means that the largest value will receive the largest rank. As we’ll see, this can be reversed with desc().

Rank/Order Fns

row_number(): equivalent to rank(x, ties.method = “first”)
- If tied, earliest entry will “win” (be assigned the lower rank).
min_rank(): equivalent to rank(x, ties.method = “min”)
- tied values receive the same (lowest) rank, creating gaps
- e.g., raw values = (10,7,3,3,2), ranks = (1,2,3,3,5)
dense_rank(): similar to min_rank() but without gaps
- e.g., raw values = (10,7,3,3,2), ranks = (1,2,3,3,4)
percent_rank(): converts to percentiles ranging from (0, 1)
- rescales min_rank() values
cume_dist(): cumulative distribution function
- propotion of values below the current value
ntile(): breaks input vector into n buckets
- e.g., (n = 10 -> deciles; n = 5 -> quintiles, etc.)
- can be arbitrary when assigning buckets for values at threshold

Rank/Order Example

Within each week, rank/order the sales employees by their weekly sale count from highest to lowest sales:

dat1 %>% 
  group_by(week) %>% 
  mutate(min_rank = min_rank(desc(sales)),
         row_numb = row_number(desc(sales)),
         dense_rank = dense_rank(desc(sales)),
         pct_rank = percent_rank((sales)),
         cume_dist = cume_dist((sales)),
         ntiles_5 = ntile(desc(sales),n = 5)) %>% 
  select(emp_id,week,sales,min_rank:ntiles_5) %>% 
  arrange(week, desc(sales)) -> dat1

dat1 %>% 
  head(5)

## # A tibble: 5 × 9
## # Groups:   week [1]
##   emp_id  week sales min_rank row_numb dense_rank pct_rank cume_dist ntiles_5
##    <dbl> <int> <int>    <int>    <int>      <int>    <dbl>     <dbl>    <int>
## 1 100101     1    10        1        1          1     1          1          1
## 2 105043     1     9        2        2          2     0.5        0.8        2
## 3 111302     1     9        2        3          2     0.5        0.8        3
## 4 102341     1     8        4        4          3     0.25       0.4        4
## 5 102423     1     7        5        5          4     0          0.2        5

Offsets

Offset fns find the next (lead) and previous (lag) values in a vector relative to the current value.

Offset Fn Example

For each sales employee and each of their sales weeks, record the employee’s sales for the previous and next weeks:

dat1 %>% 
  arrange(emp_id,week) %>% 
  group_by(emp_id) %>% 
  mutate(next_wk_sales = lead(sales),
         last_wk_sales = lag(sales)) %>% 
  ungroup() %>% 
  relocate(c(next_wk_sales,last_wk_sales),
           .after = sales) -> dat1

dat1 %>% 
  head(10)

## # A tibble: 10 × 11
##    emp_id  week sales next_wk_sales last_wk_sales min_rank row_numb dense_rank
##     <dbl> <int> <int>         <int>         <int>    <int>    <int>      <int>
##  1 100101     1    10             3            NA        1        1          1
##  2 100101     2     3             6            10        4        4          4
##  3 100101     3     6             8             3        2        2          2
##  4 100101     4     8             0             6        1        1          1
##  5 100101     5     0             9             8        5        5          5
##  6 100101     6     9             9             0        2        2          2
##  7 100101     7     9             5             9        2        2          2
##  8 100101     8     5             2             9        2        2          2
##  9 100101     9     2             6             5        5        5          5
## 10 100101    10     6            NA             2        3        3          2
## # … with 3 more variables: pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>

Cumulatives

The following cumulative aggregates are included in either base R or dplyr:

cumsum(): cumulative sum
- e.g. (wk1 = 5, wk2 = 3) -> (cumsum_wk1 = 5, cumsum_wk2 = 7)
cummin(): cumulative minimum value
- e.g. (wk1 = 5, wk2 = 3) -> (cummin_wk1 = 5, cummin_wk2 = 3)
cummax(): cumulative max value
- e.g. (wk1 = 5, wk2 = 3) -> (cummax_wk1 = 5, cummax_wk2 = 5)
cummean(): cumulative average
- e.g. (wk1 = 5, wk2 = 3) -> (cummean_wk1 = 5, cummean_wk2 = 4)
cumall(X): all records until X is False
- e.g. (wk1 = 5, wk2 = 3) -> (cumall(X >= 5) returns wk1 = 5)
cumany(X): all records after X is True
- e.g. (wk1 = 5, wk2 = 3) -> (cumany(X >= 5) returns wk1 = 5, wk2 = 3)

Cumulative Aggregate Examples

cumsum, cummin, cummax, cummean: For each sales employee, calculate the cumulative total, min, max, and avg sale counts:

dat1 %>% 
  group_by(emp_id) %>% 
  mutate(cume_sum = cumsum(sales),
         cume_min = cummin(sales),
         cume_max = cummax(sales),
         cume_mean = round(cummean(sales),2)) %>% 
  relocate(cume_sum:cume_mean,
           .after = sales) -> dat1

dat1 %>% 
  head(5)

## # A tibble: 5 × 15
## # Groups:   emp_id [1]
##   emp_id  week sales cume_sum cume_min cume_max cume_mean next_wk_sales
##    <dbl> <int> <int>    <int>    <int>    <int>     <dbl>         <int>
## 1 100101     1    10       10       10       10     10                3
## 2 100101     2     3       13        3       10      6.5              6
## 3 100101     3     6       19        3       10      6.33             8
## 4 100101     4     8       27        3       10      6.75             0
## 5 100101     5     0       27        0       10      5.4              9
## # … with 7 more variables: last_wk_sales <int>, min_rank <int>, row_numb <int>,
## #   dense_rank <int>, pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>

cumany(): For each sales employee, identify the weeks after they dipped below a weekly sale count of 3

dat1 %>% 
  group_by(emp_id) %>% 
  mutate(cumany_below3 = cumany(sales < 3)) %>% 
  relocate(cumany_below3,
           .after = sales) -> dat1

dat1 %>% 
  head(10)

## # A tibble: 10 × 16
## # Groups:   emp_id [1]
##    emp_id  week sales cumany_below3 cume_sum cume_min cume_max cume_mean
##     <dbl> <int> <int> <lgl>            <int>    <int>    <int>     <dbl>
##  1 100101     1    10 FALSE               10       10       10     10   
##  2 100101     2     3 FALSE               13        3       10      6.5 
##  3 100101     3     6 FALSE               19        3       10      6.33
##  4 100101     4     8 FALSE               27        3       10      6.75
##  5 100101     5     0 TRUE                27        0       10      5.4 
##  6 100101     6     9 TRUE                36        0       10      6   
##  7 100101     7     9 TRUE                45        0       10      6.43
##  8 100101     8     5 TRUE                50        0       10      6.25
##  9 100101     9     2 TRUE                52        0       10      5.78
## 10 100101    10     6 TRUE                58        0       10      5.8 
## # … with 8 more variables: next_wk_sales <int>, last_wk_sales <int>,
## #   min_rank <int>, row_numb <int>, dense_rank <int>, pct_rank <dbl>,
## #   cume_dist <dbl>, ntiles_5 <int>

cumall(): For each sales employee, identify the records until they fall below 3 sales in a week:

dat1 %>% 
  group_by(emp_id) %>% 
  mutate(cumall_below3 = cumall(!sales < 3)) %>% 
  relocate(cumall_below3,
           .after = sales) -> dat1

dat1 %>% 
  head(10)

## # A tibble: 10 × 17
## # Groups:   emp_id [1]
##    emp_id  week sales cumall_below3 cumany_below3 cume_sum cume_min cume_max
##     <dbl> <int> <int> <lgl>         <lgl>            <int>    <int>    <int>
##  1 100101     1    10 TRUE          FALSE               10       10       10
##  2 100101     2     3 TRUE          FALSE               13        3       10
##  3 100101     3     6 TRUE          FALSE               19        3       10
##  4 100101     4     8 TRUE          FALSE               27        3       10
##  5 100101     5     0 FALSE         TRUE                27        0       10
##  6 100101     6     9 FALSE         TRUE                36        0       10
##  7 100101     7     9 FALSE         TRUE                45        0       10
##  8 100101     8     5 FALSE         TRUE                50        0       10
##  9 100101     9     2 FALSE         TRUE                52        0       10
## 10 100101    10     6 FALSE         TRUE                58        0       10
## # … with 9 more variables: cume_mean <dbl>, next_wk_sales <int>,
## #   last_wk_sales <int>, min_rank <int>, row_numb <int>, dense_rank <int>,
## #   pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>

Thanks for visiting!