My background is Oracle databases and one of the most useful features of SQL in my opinion are windowing or analytic functons. These allow access to data on other rows (LAG will retrieve a value from the previous row for example) or to aggregate data and combine on the current row (SUM will aggregate and then put the total on each row within a window) without complicated correlated sub-queries.
With the dplyr and zoo packages, R replicates the syntax and functionality of these SQL functions. This vignette will show SQL and its’ R equivalent to demonstrate the power and ease of implementing what has become a fundamental building block of SQL
To demonstrate the queries the mtcars dataset will be used. Its structure is shown here.
library(dplyr)
library(zoo)
cars <- mtcars
str(cars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Firstly, SQL provides a number of Aggregate functions to roll up data to a grouped level. For example a query might be to group by the cyl column:
SELECT cyl, AVG(mpg) mpg_mean,
COUNT(*) cnt, SUM(mpg) mpg_sum,
MAX(mpg) mpg_max
FROM cars
GROUP BY cyl;
The equivalent statement in dplyr is quite similar in the elements required:
cars %>% # TABLE Name
group_by(cyl) %>% # GROUP_BY clause
summarise(mpg.mean = mean(mpg), # Aggregate Functions AVG, SUM and COUNT
cnt = n(),
mpg.sum = sum(mpg),
mpg.max = max(mpg))
## # A tibble: 3 x 5
## cyl mpg.mean cnt mpg.sum mpg.max
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 4 26.7 11 293. 33.9
## 2 6 19.7 7 138. 21.4
## 3 8 15.1 14 211. 19.2
The statement above summarises the data to one observation per cyl. Often in SQL you may wish to retain the original rows of data but enhance each row with aggregate data or data from other rows. The SQL below gives a row number to each record and finds the previous (or lag) mpg for each car ordered by the mpg, ht and wt columns, for cars that have 6 cylinders. The window here spans each cyl, so without the WHERE, every row would be shown and the row_number count would restart for each cyl group:
SELECT mpg, cyl, hp, wt, qsec,
ROW_NUMBER() OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) rn,
LAG(mpg) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) mpg_lag
FROM cars
WHERE cyl = 6
In R dplyr, instead of using the summarise function, mutate is used so that the original observations are preserved but extra columns are added:
cars %>% # TABLE Name
group_by(cyl) %>% # PARTITION BY clause
arrange(mpg, hp, wt) %>% # ORDER BY clause
mutate(rn = row_number(), # Analytic functions ROW_NUMBER and LAG
mpg.lag = lag(mpg)) %>%
ungroup() %>%
# SELECT clause
select(mpg, cyl, hp, wt, qsec, rn, mpg.lag) %>%
# WHERE clause
filter(cyl == 6)
## # A tibble: 7 x 7
## mpg cyl hp wt qsec rn mpg.lag
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 17.8 6 123 3.44 18.9 1 NA
## 2 18.1 6 105 3.46 20.2 2 17.8
## 3 19.2 6 123 3.44 18.3 3 18.1
## 4 19.7 6 175 2.77 15.5 4 19.2
## 5 21 6 110 2.62 16.5 5 19.7
## 6 21 6 110 2.88 17.0 6 21
## 7 21.4 6 110 3.22 19.4 7 21
Finally, Aggregate Analytic functions return summary columns on each row, so let’s add the number of observations, maximum mpg, average qsec and the rolling average qsec within cars that have 6 cylinders. Then instead of ordering by mpg, hp and wt, let’s order by drat to find the next (lead) mpg as well. The SQL would be:
SELECT mpg, cyl, hp, wt, qsec,
ROW_NUMBER() OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) rn,
LAG(mpg) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt) mpg_lag,
MAX(mpg) OVER (PARTITION BY cyl) mpg_max,
AVG(qsec) OVER (PARTITION BY cyl) qsecc_mean,
AVG(qsec) OVER (PARTITION BY cyl ORDER BY mpg, hp, wt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) qsec_cummean,
drat,
LEAD(mpg) OVER (PARTITION BY cyl ORDER BY drat) mpg_lead
FROM cars
WHERE cyl = 6
Note that for the cumulative mean, SQL needed the RANGE BETWEEN statement to define the window as from the start of the window (the first row per cyl) to the current row. The zoo package provides a cummean function that does the same thing.
cars %>% # TABLE Name
group_by(cyl) %>% # PARTITION BY clause
arrange(mpg, hp, wt) %>% # ORDER BY clause
mutate(rn = row_number(), # Analytic Functions
mpg.lag = lag(mpg),
cnt = n(),
mpg.max = max(mpg),
qsec.mean = mean(qsec),
qsec.cummean = cummean(qsec)) %>%
arrange(drat, qsec) %>% # New ORDER BY Clause
mutate(mpg.lead = lead(mpg)) %>% # LEAD analytic function
arrange(rn) %>%
ungroup() %>%
# SELECT clause
select(mpg, cyl, hp, wt, qsec, rn, mpg.lag, mpg.max, qsec.mean,
qsec.cummean, drat, mpg.lead) %>%
# WHERE clause
filter(cyl == 6)
## # A tibble: 7 x 12
## mpg cyl hp wt qsec rn mpg.lag mpg.max qsec.mean qsec.cummean drat mpg.lead
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 17.8 6 123 3.44 18.9 1 NA 21.4 18.0 18.9 3.92 NA
## 2 18.1 6 105 3.46 20.2 2 17.8 21.4 18.0 19.6 2.76 21.4
## 3 19.2 6 123 3.44 18.3 3 18.1 21.4 18.0 19.1 3.92 17.8
## 4 19.7 6 175 2.77 15.5 4 19.2 21.4 18.0 18.2 3.62 21
## 5 21 6 110 2.62 16.5 5 19.7 21.4 18.0 17.9 3.9 21
## 6 21 6 110 2.88 17.0 6 21 21.4 18.0 17.7 3.9 19.2
## 7 21.4 6 110 3.22 19.4 7 21 21.4 18.0 18.0 3.08 19.7
The dplyr package offers a number of functions to mirror what SQL does. Further functions that can be used in this structure are inner_join, left_join and full_join for joining datasets together. So, transitioning between SQL and R is possible, keeping similar command structures.