Report 1. Pipes & dplyr.

Piping and more

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Pipes

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent.

the_data <-
  read.csv('/path/to/data/file.csv') %>%
  subset(variable_a > x) %>%
  transform(variable_c = variable_a/variable_b) %>%
  head(100)

Four operations are performed to arrive at the desired data set, and they are written in a natural order: the same as the order of execution. Also, no temporary variables are needed. If yet another operation is required, it is straightforward to add to the sequence of operations wherever it may be needed.

Assignment Pipe

The assignment pipe, %<>%, is used to update a value by first piping it into one or more rhs expressions, and then assigning the result.

x <- rnorm(100)
x %<>% abs %>% sort
head(x)  # it works!

## [1] 0.01890394 0.03808962 0.04359587 0.04718125 0.04988122 0.05349861

Pipe

Pipe an object forward into a function or call expression.

# Basic use:
iris %>% head

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Exposition Pipe

Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function. There are also functions that do not have a data argument, for which it is useful to expose the variables in the data. This is done with the %$% operator:

iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)

## [1] 0.3361992

Tee Pipe

The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.

This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.

In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():

set.seed(123)
rnorm(200) %>%
matrix(ncol = 2) %T>%
plot %>% 
colSums

## [1]   9.040591 -10.754680

dplyr

In this section, you will discover how exciting it can be when you combine both packages in your R code.

For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.

Take an example of some traditional code that makes use of these dplyr functions:

grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data, 
                arr = mean(ArrDelay, na.rm = TRUE), 
                dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)

final_result

## # A tibble: 14 × 5
## # Groups:   Year, Month [10]
##     Year Month DayofMonth   arr   dep
##    <int> <int>      <int> <dbl> <dbl>
##  1  2011     2          4  44.1  47.2
##  2  2011     3          3  35.1  38.2
##  3  2011     3         14  46.6  36.1
##  4  2011     4          4  38.7  27.9
##  5  2011     4         25  37.8  22.3
##  6  2011     5         12  69.5  64.5
##  7  2011     5         20  37.0  26.6
##  8  2011     6         22  65.5  62.3
##  9  2011     7         29  29.6  31.9
## 10  2011     9         29  39.2  32.5
## 11  2011    10          9  61.9  59.5
## 12  2011    11         15  43.7  39.2
## 13  2011    12         29  26.3  30.8
## 14  2011    12         31  46.5  54.2

When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:

hflights %>% 
    group_by(Year, Month, DayofMonth) %>% 
    select(Year:DayofMonth, ArrDelay, DepDelay) %>% 
    summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>% 
    filter(arr > 30 | dep > 30)

## # A tibble: 14 × 5
## # Groups:   Year, Month [10]
##     Year Month DayofMonth   arr   dep
##    <int> <int>      <int> <dbl> <dbl>
##  1  2011     2          4  44.1  47.2
##  2  2011     3          3  35.1  38.2
##  3  2011     3         14  46.6  36.1
##  4  2011     4          4  38.7  27.9
##  5  2011     4         25  37.8  22.3
##  6  2011     5         12  69.5  64.5
##  7  2011     5         20  37.0  26.6
##  8  2011     6         22  65.5  62.3
##  9  2011     7         29  29.6  31.9
## 10  2011     9         29  39.2  32.5
## 11  2011    10          9  61.9  59.5
## 12  2011    11         15  43.7  39.2
## 13  2011    12         29  26.3  30.8
## 14  2011    12         31  46.5  54.2

Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”.

To practice “dplyr” skills, we will use the flights data set from the nycflights13 package.

This data frame comes from the US Bureau of Transportation Statistics and contains all 336,776 flights that departed from New York City in 2013.

It is documented in ?flights.

Filter

Read the dataframe “flights”, then filter only flights from 1st of January. Save the output to the new dataframe called “mydata”.

Filter flights operated by United (UA), American (AA), or Delta (DL):

Another useful dplyr filtering helper is between(). Filter flights departed between midnight and 6am (inclusive). Don’t forget flights that left at exactly midnight (2400).

Arrange

We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function.

Again we can do this in base R but this is not always a clear path…

flights[order(flights$dep_delay), c("carrier", "dep_delay")]

## # A tibble: 336,776 × 2
##    carrier dep_delay
##    <chr>       <dbl>
##  1 B6            -43
##  2 DL            -33
##  3 EV            -32
##  4 DL            -30
##  5 F9            -27
##  6 MQ            -26
##  7 EV            -25
##  8 MQ            -25
##  9 9E            -24
## 10 B6            -24
## # ℹ 336,766 more rows

Now, it’s your turn, to use “arrange” instead for the above base-R example:

flights %>%
   arrange(dep_delay)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12     7     2040           2123       -43       40           2352
##  2  2013     2     3     2022           2055       -33     2240           2338
##  3  2013    11    10     1408           1440       -32     1549           1559
##  4  2013     1    11     1900           1930       -30     2233           2243
##  5  2013     1    29     1703           1730       -27     1947           1957
##  6  2013     8     9      729            755       -26     1002            955
##  7  2013    10    23     1907           1932       -25     2143           2143
##  8  2013     3    30     2030           2055       -25     2213           2250
##  9  2013     3     2     1431           1455       -24     1601           1631
## 10  2013     5     5      934            958       -24     1225           1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Mutate

mutate() returns a new data frame that contains the new variables appended to a copy of the original data set.

Note that when you use mutate() you can create multiple variables at once, and you can even refer to variables that are created earlier in the call to create other variables later in the call:

mutate(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

## # A tibble: 336,776 × 22
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, hours <dbl>,
## #   gain_per_hour <dbl>

Transmute

mutate() will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use transmute(). In the code below, replace mutate() with transmute() and then spot the difference in the results.

mutate(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

## # A tibble: 336,776 × 22
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, hours <dbl>,
## #   gain_per_hour <dbl>

Find the 10 most delayed flights (dep_delay) using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank(). Hint: Once you compute a rank, you can filter the data set based on the ranks.

flights %>%
  arrange(-dep_delay) %>%
  head(10)

## # A tibble: 10 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Summarise

summarise() collapses a data frame to a single row of summaries. You get to choose how many summaries appear in the row and how they are computed:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE), 
                  total =  sum(dep_delay, na.rm = TRUE))

## # A tibble: 1 × 2
##   delay   total
##   <dbl>   <dbl>
## 1  12.6 4152200

summarise() is not terribly useful unless you pair it with group_by().

group_by() changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by() assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by().

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
                  total = sum(dep_delay, na.rm = TRUE))

## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day delay total
##    <int> <int> <int> <dbl> <dbl>
##  1  2013     1     1 11.5   9678
##  2  2013     1     2 13.9  12958
##  3  2013     1     3 11.0   9933
##  4  2013     1     4  8.95  8137
##  5  2013     1     5  5.73  4110
##  6  2013     1     6  7.15  5940
##  7  2013     1     7  5.42  5038
##  8  2013     1     8  2.55  2285
##  9  2013     1     9  2.28  2042
## 10  2013     1    10  2.84  2643
## # ℹ 355 more rows

Which carrier has the worst delays?

Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint1: think about flights %>% group_by(carrier, dest) %>% summarise(n())).

Hint2: Use min_rank(desc(avg_delay)) to rank avg_delay (for example) such that the largest delay receives rank one.

flights %>%
  filter(arr_delay>0) %>%
  group_by(carrier) %>%
  summarise(delay = mean(arr_delay)) %>%
  arrange(desc(delay))

## # A tibble: 16 × 2
##    carrier delay
##    <chr>   <dbl>
##  1 OO       60.6
##  2 YV       51.1
##  3 9E       49.3
##  4 EV       48.3
##  5 F9       47.6
##  6 VX       43.8
##  7 FL       41.1
##  8 WN       40.7
##  9 B6       40.0
## 10 AA       38.3
## 11 MQ       37.9
## 12 DL       37.7
## 13 UA       36.7
## 14 HA       35.0
## 15 AS       34.4
## 16 US       29.0