Tristan Mahr, @tjmahr
March 18, 2015
Madison R Users Group
Repository for this talk: https://github.com/tjmahr/MadR_Pipelines
I'm interested in:
I tackle these goals by building a problem-specific language from simpler, understandable pieces of code.
(But see also Best Practices for Scientific Computing for more tools and strategies.)
dplyr, stringr,
lubridate, broom, tidyr, rvesta_to_f <- head(letters)
a_to_f
tail(letters)
seq_along(a_to_f)
xs <- seq_len(10)
xs
ifelse(xs %% 2 == 0, xs, NA)
Solve a problem once, then re-use that solution elsewhere.
# Squish values into a range
squish <- function(xs, lower, upper) {
xs[xs < lower] <- lower
xs[upper < xs] <- upper
xs
}
squish(rnorm(5), -.3, 1)
[1] 1.0000000 -0.3000000 0.7980856
[4] -0.3000000 0.2153025
Create my own problem-specific language.
# Insert values into the second-to-last
# position, in case last one is a delimiter
insert_line <- function(xs, ys) {
c(but_last(xs), ys, last(xs))
}
but_last <- function(...) head(..., n = -1)
last <- function(...) tail(..., n = 1)
insert_line(c("x", "y", "z"), "&")
[1] "x" "y" "&" "z"
Here's a function adapted from the strsplit help page.
mystery_func <- function(xs) {
sapply(lapply(strsplit(xs, NULL), rev),
paste, collapse = "")
}
Okay, maybe it would help if it were built out of understandable chunks.
“Extract function” refactoring
str_tokenize <- function(xs) {
strsplit(xs, split = NULL)
}
str_collapse <- function(..., joiner = "") {
paste(..., collapse = joiner)
}
mystery_func <- function(xs) {
sapply(lapply(str_tokenize(xs), rev),
str_collapse)
}
Okay, maybe it would help if it weren't a one-liner
Do one thing per line.
mystery_func <- function(xs) {
char_sets <- str_tokenize(xs)
char_sets_rev <- lapply(char_sets, rev)
sapply(char_sets_rev, str_collapse)
}
Pretty good, but now we have these intermediate values we don't care about cluttering things up.
A way to express successive data transformations.
Use the value on the left-hand side as the first argument to the function on the right-hand side.
library("magrittr")
# Rule 1
f(xs)
xs %>% f
# Rule 2
g(xs, n = 5)
xs %>% g(n = 5)
Do function composition by chaining pipes together.
# Rule 3
g(f(xs), n = 5)
xs %>% f %>% g(n = 5)
Mentally, read %>% as “then”.
Take
xsthen dofthen dogwithn = 5.
xs <- rnorm(5)
squish(sort(round(xs, 2)), -.3, 1)
[1] -0.30 -0.30 -0.27 0.20 0.68
xs %>% round(2) %>% sort %>% squish(-.3, 1)
[1] -0.30 -0.30 -0.27 0.20 0.68
On the command line:
sort data.csv | uniq -u | wc -l
# 369 (number of unique lines)
df.groupby(['letter','one']).sum()
$("#p1")
.css("color", "red")
.slideUp(2000)
.slideDown(2000);
mystery_func <- function(xs) {
str_tokenize(xs) %>%
lapply(rev) %>%
sapply(str_collapse)
}
words <- c("The", "quick", "brown", "fox")
mystery_func(words)
[1] "ehT" "kciuq" "nworb" "xof"
Use . as an argument placeholder.
# Rule 4
f(y, x)
x %>% f(y, .)
# Rule 5
f(y, z = x)
x %>% f(y, z = .)
Placeholder says where the piped input should land.
words %>% paste0("~~", ., "~~") %>% toupper
[1] "~~THE~~" "~~QUICK~~" "~~BROWN~~"
[4] "~~FOX~~"
# As a named parameter
library("broom")
mtcars %>% lm(mpg ~ cyl * wt, data = .) %>%
tidy %>% print(digits = 2)
term estimate std.error statistic
1 (Intercept) 54.31 6.13 8.9
2 cyl -3.80 1.01 -3.8
3 wt -8.66 2.32 -3.7
4 cyl:wt 0.81 0.33 2.5
p.value
1 1.3e-09
2 7.5e-04
3 8.6e-04
4 2.0e-02
The input to the pipeline can itself be a placeholder!
num_unique <- . %>% unique %>% length
In this case, the pipeline describes a function chain that can be saved and re-used. It also has a different print method.
num_unique
Functional sequence with the following components:
1. unique(.)
2. length(.)
Use 'functions' to extract the individual functions.
mystery_func <- . %>%
str_tokenize %>%
lapply(rev) %>%
sapply(str_collapse)
mystery_func(words)
[1] "ehT" "kciuq" "nworb" "xof"
The pipe %>% and placeholder . covers 90% of magrittr.
What didn't I cover?
[, $, [[)%<>% (sugar for x <- x %>% ... )%T>% (print, plot, save results during pipeline
without interrupting the flow of data)%$% (like with as an infix)See the magrittr vignette.
Basic scheme:
Next set of slides: dplyr for data-frames.