Gordon Goodwin
This tutorial will illustrate how to use window functions with dplyr. I highly recommend you also check out the dplyr window function vignette, which this guide follows.
A window function is a type of aggregation function that takes in n inputs and also returns n values. In this way, it differs from summary functions that take in n inputs but only return 1 value (e.g. mean, median, sum, etc.)
We’ll cover 3 main types of window fns in this tutorial:
Let’s first create a sample dataset consisting of 10 weeks worth of sales data for 5 different sales employees:
set.seed(999)
<- tibble(emp_id = rep(c(100101,102423,105043,102341,111302),
dat1 rep(10, 5)),
week = rep(1:10,5),
sales = sample(0:10,50,T),
calls = sample(100:300,50,T),
mgr = c(rep("Bob",20),rep("Sally",20),rep("Omar",10)))
%>%
dat1 arrange(week,desc(sales)) %>%
head(5)
## # A tibble: 5 × 5
## emp_id week sales calls mgr
## <dbl> <int> <int> <int> <chr>
## 1 100101 1 10 186 Bob
## 2 105043 1 9 222 Sally
## 3 111302 1 9 223 Omar
## 4 102341 1 8 132 Sally
## 5 102423 1 7 116 Bob
Ranking and ordering functions take a vector to order by, and return various types of ranks. Importantly, they all map smallest inputs to smallest outputs, which means that the largest value will receive the largest rank. As we’ll see, this can be reversed with desc().
Within each week, rank/order the sales employees by their weekly sale count from highest to lowest sales:
%>%
dat1 group_by(week) %>%
mutate(min_rank = min_rank(desc(sales)),
row_numb = row_number(desc(sales)),
dense_rank = dense_rank(desc(sales)),
pct_rank = percent_rank((sales)),
cume_dist = cume_dist((sales)),
ntiles_5 = ntile(desc(sales),n = 5)) %>%
select(emp_id,week,sales,min_rank:ntiles_5) %>%
arrange(week, desc(sales)) -> dat1
%>%
dat1 head(5)
## # A tibble: 5 × 9
## # Groups: week [1]
## emp_id week sales min_rank row_numb dense_rank pct_rank cume_dist ntiles_5
## <dbl> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
## 1 100101 1 10 1 1 1 1 1 1
## 2 105043 1 9 2 2 2 0.5 0.8 2
## 3 111302 1 9 2 3 2 0.5 0.8 3
## 4 102341 1 8 4 4 3 0.25 0.4 4
## 5 102423 1 7 5 5 4 0 0.2 5
Offset fns find the next (lead) and previous (lag) values in a vector relative to the current value.
For each sales employee and each of their sales weeks, record the employee’s sales for the previous and next weeks:
%>%
dat1 arrange(emp_id,week) %>%
group_by(emp_id) %>%
mutate(next_wk_sales = lead(sales),
last_wk_sales = lag(sales)) %>%
ungroup() %>%
relocate(c(next_wk_sales,last_wk_sales),
.after = sales) -> dat1
%>%
dat1 head(10)
## # A tibble: 10 × 11
## emp_id week sales next_wk_sales last_wk_sales min_rank row_numb dense_rank
## <dbl> <int> <int> <int> <int> <int> <int> <int>
## 1 100101 1 10 3 NA 1 1 1
## 2 100101 2 3 6 10 4 4 4
## 3 100101 3 6 8 3 2 2 2
## 4 100101 4 8 0 6 1 1 1
## 5 100101 5 0 9 8 5 5 5
## 6 100101 6 9 9 0 2 2 2
## 7 100101 7 9 5 9 2 2 2
## 8 100101 8 5 2 9 2 2 2
## 9 100101 9 2 6 5 5 5 5
## 10 100101 10 6 NA 2 3 3 2
## # … with 3 more variables: pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>
The following cumulative aggregates are included in either base R or dplyr:
%>%
dat1 group_by(emp_id) %>%
mutate(cume_sum = cumsum(sales),
cume_min = cummin(sales),
cume_max = cummax(sales),
cume_mean = round(cummean(sales),2)) %>%
relocate(cume_sum:cume_mean,
.after = sales) -> dat1
%>%
dat1 head(5)
## # A tibble: 5 × 15
## # Groups: emp_id [1]
## emp_id week sales cume_sum cume_min cume_max cume_mean next_wk_sales
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 100101 1 10 10 10 10 10 3
## 2 100101 2 3 13 3 10 6.5 6
## 3 100101 3 6 19 3 10 6.33 8
## 4 100101 4 8 27 3 10 6.75 0
## 5 100101 5 0 27 0 10 5.4 9
## # … with 7 more variables: last_wk_sales <int>, min_rank <int>, row_numb <int>,
## # dense_rank <int>, pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>
%>%
dat1 group_by(emp_id) %>%
mutate(cumany_below3 = cumany(sales < 3)) %>%
relocate(cumany_below3,
.after = sales) -> dat1
%>%
dat1 head(10)
## # A tibble: 10 × 16
## # Groups: emp_id [1]
## emp_id week sales cumany_below3 cume_sum cume_min cume_max cume_mean
## <dbl> <int> <int> <lgl> <int> <int> <int> <dbl>
## 1 100101 1 10 FALSE 10 10 10 10
## 2 100101 2 3 FALSE 13 3 10 6.5
## 3 100101 3 6 FALSE 19 3 10 6.33
## 4 100101 4 8 FALSE 27 3 10 6.75
## 5 100101 5 0 TRUE 27 0 10 5.4
## 6 100101 6 9 TRUE 36 0 10 6
## 7 100101 7 9 TRUE 45 0 10 6.43
## 8 100101 8 5 TRUE 50 0 10 6.25
## 9 100101 9 2 TRUE 52 0 10 5.78
## 10 100101 10 6 TRUE 58 0 10 5.8
## # … with 8 more variables: next_wk_sales <int>, last_wk_sales <int>,
## # min_rank <int>, row_numb <int>, dense_rank <int>, pct_rank <dbl>,
## # cume_dist <dbl>, ntiles_5 <int>
%>%
dat1 group_by(emp_id) %>%
mutate(cumall_below3 = cumall(!sales < 3)) %>%
relocate(cumall_below3,
.after = sales) -> dat1
%>%
dat1 head(10)
## # A tibble: 10 × 17
## # Groups: emp_id [1]
## emp_id week sales cumall_below3 cumany_below3 cume_sum cume_min cume_max
## <dbl> <int> <int> <lgl> <lgl> <int> <int> <int>
## 1 100101 1 10 TRUE FALSE 10 10 10
## 2 100101 2 3 TRUE FALSE 13 3 10
## 3 100101 3 6 TRUE FALSE 19 3 10
## 4 100101 4 8 TRUE FALSE 27 3 10
## 5 100101 5 0 FALSE TRUE 27 0 10
## 6 100101 6 9 FALSE TRUE 36 0 10
## 7 100101 7 9 FALSE TRUE 45 0 10
## 8 100101 8 5 FALSE TRUE 50 0 10
## 9 100101 9 2 FALSE TRUE 52 0 10
## 10 100101 10 6 FALSE TRUE 58 0 10
## # … with 9 more variables: cume_mean <dbl>, next_wk_sales <int>,
## # last_wk_sales <int>, min_rank <int>, row_numb <int>, dense_rank <int>,
## # pct_rank <dbl>, cume_dist <dbl>, ntiles_5 <int>
Thanks for visiting!