My background is Oracle databases and one of the most useful features of SQL in my opinion are windowing or analytic functons. These allow access to data on other rows (LAG will retrieve a value from the previous row for example) or to aggregate data and combine on the current row (SUM will aggregate and then put the total on each row within a window) without complicated correlated sub-queries.

With the dplyr and zoo packages, R replicates the syntax and functionality of these SQL functions. This vignette will show SQL and its’ R equivalent to demonstrate the power and ease of implementing what has become a fundamental building block of SQL

MtCars Dataset

To demonstrate the queries the mtcars dataset will be used. Its structure is shown here.

library(dplyr)
library(zoo)
cars <- mtcars
str(cars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Aggregate SQL

Firstly, SQL provides a number of Aggregate functions to roll up data to a grouped level. For example a query might be to group by the cyl column:

SELECT cyl, AVG(mpg) mpg_mean, 
       COUNT(*) cnt, SUM(mpg) mpg_sum, 
       MAX(mpg) mpg_max
FROM   cars
GROUP BY cyl;

The equivalent statement in dplyr is quite similar in the elements required:

cars %>%                           # TABLE Name
  group_by(cyl) %>%                # GROUP_BY clause
  summarise(mpg.mean = mean(mpg),  # Aggregate Functions AVG, SUM and COUNT
            cnt = n(),
            mpg.sum = sum(mpg),
            mpg.max = max(mpg))
## # A tibble: 3 x 5
##     cyl mpg.mean   cnt mpg.sum mpg.max
##   <dbl>    <dbl> <int>   <dbl>   <dbl>
## 1     4     26.7    11    293.    33.9
## 2     6     19.7     7    138.    21.4
## 3     8     15.1    14    211.    19.2

Analytic and Windowing SQL

The statement above summarises the data to one observation per cyl. Often in SQL you may wish to retain the original rows of data but enhance each row with aggregate data or data from other rows. The SQL below gives a row number to each record and finds the previous (or lag) mpg for each car ordered by the mpg, ht and wt columns, for cars that have 6 cylinders. The window here spans each cyl, so without the WHERE, every row would be shown and the row_number count would restart for each cyl group:

SELECT mpg, cyl, hp, wt, qsec, 
       ROW_NUMBER() OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) rn,
       LAG(mpg) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) mpg_lag
FROM cars
WHERE cyl = 6

In R dplyr, instead of using the summarise function, mutate is used so that the original observations are preserved but extra columns are added:

cars %>%                    # TABLE Name
  group_by(cyl) %>%         # PARTITION BY clause
  arrange(mpg, hp, wt) %>%  # ORDER BY clause 
  mutate(rn = row_number(), # Analytic functions ROW_NUMBER and LAG
         mpg.lag = lag(mpg)) %>%
  ungroup() %>%
  # SELECT clause
  select(mpg, cyl, hp, wt, qsec, rn, mpg.lag) %>%
  # WHERE clause
  filter(cyl == 6)
## # A tibble: 7 x 7
##     mpg   cyl    hp    wt  qsec    rn mpg.lag
##   <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <dbl>
## 1  17.8     6   123  3.44  18.9     1    NA  
## 2  18.1     6   105  3.46  20.2     2    17.8
## 3  19.2     6   123  3.44  18.3     3    18.1
## 4  19.7     6   175  2.77  15.5     4    19.2
## 5  21       6   110  2.62  16.5     5    19.7
## 6  21       6   110  2.88  17.0     6    21  
## 7  21.4     6   110  3.22  19.4     7    21

Finally, Aggregate Analytic functions return summary columns on each row, so let’s add the number of observations, maximum mpg, average qsec and the rolling average qsec within cars that have 6 cylinders. Then instead of ordering by mpg, hp and wt, let’s order by drat to find the next (lead) mpg as well. The SQL would be:

SELECT mpg, cyl, hp, wt, qsec, 
       ROW_NUMBER() OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) rn,
       LAG(mpg) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) mpg_lag,
       MAX(mpg) OVER (PARTITION BY cyl) mpg_max,
       AVG(qsec) OVER (PARTITION BY cyl) qsecc_mean,
       AVG(qsec) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt 
                RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) qsec_cummean,
       drat,
       LEAD(mpg) OVER (PARTITION BY cyl ORDER BY drat) mpg_lead       
FROM cars
WHERE cyl = 6

Note that for the cumulative mean, SQL needed the RANGE BETWEEN statement to define the window as from the start of the window (the first row per cyl) to the current row. The zoo package provides a cummean function that does the same thing.

cars %>%                    # TABLE Name
  group_by(cyl) %>%         # PARTITION BY clause
  arrange(mpg, hp, wt) %>%  # ORDER BY clause 
  mutate(rn = row_number(), # Analytic Functions
         mpg.lag = lag(mpg),
         cnt = n(), 
         mpg.max = max(mpg),
         qsec.mean = mean(qsec),
         qsec.cummean = cummean(qsec)) %>%
  arrange(drat, qsec) %>%          # New ORDER BY Clause
  mutate(mpg.lead = lead(mpg)) %>% # LEAD analytic function
  arrange(rn) %>%
  ungroup() %>%
  # SELECT clause
  select(mpg, cyl, hp, wt, qsec, rn, mpg.lag, mpg.max, qsec.mean, 
         qsec.cummean, drat, mpg.lead) %>%
  # WHERE clause
  filter(cyl == 6)
## # A tibble: 7 x 12
##     mpg   cyl    hp    wt  qsec    rn mpg.lag mpg.max qsec.mean qsec.cummean  drat mpg.lead
##   <dbl> <dbl> <dbl> <dbl> <dbl> <int>   <dbl>   <dbl>     <dbl>        <dbl> <dbl>    <dbl>
## 1  17.8     6   123  3.44  18.9     1    NA      21.4      18.0         18.9  3.92     NA  
## 2  18.1     6   105  3.46  20.2     2    17.8    21.4      18.0         19.6  2.76     21.4
## 3  19.2     6   123  3.44  18.3     3    18.1    21.4      18.0         19.1  3.92     17.8
## 4  19.7     6   175  2.77  15.5     4    19.2    21.4      18.0         18.2  3.62     21  
## 5  21       6   110  2.62  16.5     5    19.7    21.4      18.0         17.9  3.9      21  
## 6  21       6   110  2.88  17.0     6    21      21.4      18.0         17.7  3.9      19.2
## 7  21.4     6   110  3.22  19.4     7    21      21.4      18.0         18.0  3.08     19.7

Conclusion

The dplyr package offers a number of functions to mirror what SQL does. Further functions that can be used in this structure are inner_join, left_join and full_join for joining datasets together. So, transitioning between SQL and R is possible, keeping similar command structures.