Report 1. Pipes & dplyr.
Piping and more
R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
Including Plots
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
Pipes
The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent.
the_data <-
read.csv('/path/to/data/file.csv') %>%
subset(variable_a > x) %>%
transform(variable_c = variable_a/variable_b) %>%
head(100)
Four operations are performed to arrive at the desired data set, and they are written in a natural order: the same as the order of execution. Also, no temporary variables are needed. If yet another operation is required, it is straightforward to add to the sequence of operations wherever it may be needed.
Assignment Pipe
The assignment pipe, %<>%, is used to update a value by first piping it into one or more rhs expressions, and then assigning the result.
## [1] 0.01890394 0.03808962 0.04359587 0.04718125 0.04988122 0.05349861
Pipe
Pipe an object forward into a function or call expression.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Exposition Pipe
Many functions accept a data argument, e.g. lm and aggregate, which is very useful in a pipeline where data is first processed and then passed into such a function. There are also functions that do not have a data argument, for which it is useful to expose the variables in the data. This is done with the %$% operator:
## [1] 0.3361992
Tee Pipe
The tee operator works exactly like %>%, but it returns the left-hand side value rather than the potential result of the right-hand side operations.
This means that the tee operator can come in handy in situations where you have included functions that are used for their side effect, such as plotting with plot() or printing to a file.
In other words, functions like plot() typically don’t return anything. That means that, after calling plot(), for example, your pipeline would end. However, in the following example, the tee operator %T>% allows you to continue your pipeline even after you have used plot():
## [1] 9.040591 -10.754680
dplyr
In this section, you will discover how exciting it can be when you combine both packages in your R code.
For those of you who are new to the dplyr package, you should know that this R package was built around five verbs, namely, “select”, “filter”, “arrange”, “mutate” and “summarize”. If you have already manipulated data for some data science project, you will know that these verbs make up the majority of the data manipulation tasks that you generally need to perform on your data.
Take an example of some traditional code that makes use of these dplyr functions:
grouped_flights <- group_by(hflights, Year, Month, DayofMonth)
flights_data <- select(grouped_flights, Year:DayofMonth, ArrDelay, DepDelay)
summarized_flights <- summarise(flights_data,
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE))
final_result <- filter(summarized_flights, arr > 30 | dep > 30)
final_result## # A tibble: 14 × 5
## # Groups: Year, Month [10]
## Year Month DayofMonth arr dep
## <int> <int> <int> <dbl> <dbl>
## 1 2011 2 4 44.1 47.2
## 2 2011 3 3 35.1 38.2
## 3 2011 3 14 46.6 36.1
## 4 2011 4 4 38.7 27.9
## 5 2011 4 25 37.8 22.3
## 6 2011 5 12 69.5 64.5
## 7 2011 5 20 37.0 26.6
## 8 2011 6 22 65.5 62.3
## 9 2011 7 29 29.6 31.9
## 10 2011 9 29 39.2 32.5
## 11 2011 10 9 61.9 59.5
## 12 2011 11 15 43.7 39.2
## 13 2011 12 29 26.3 30.8
## 14 2011 12 31 46.5 54.2
When you look at this example, you immediately understand why dplyr and magrittr are able to work so well together:
hflights %>%
group_by(Year, Month, DayofMonth) %>%
select(Year:DayofMonth, ArrDelay, DepDelay) %>%
summarise(arr = mean(ArrDelay, na.rm = TRUE), dep = mean(DepDelay, na.rm = TRUE)) %>%
filter(arr > 30 | dep > 30)## # A tibble: 14 × 5
## # Groups: Year, Month [10]
## Year Month DayofMonth arr dep
## <int> <int> <int> <dbl> <dbl>
## 1 2011 2 4 44.1 47.2
## 2 2011 3 3 35.1 38.2
## 3 2011 3 14 46.6 36.1
## 4 2011 4 4 38.7 27.9
## 5 2011 4 25 37.8 22.3
## 6 2011 5 12 69.5 64.5
## 7 2011 5 20 37.0 26.6
## 8 2011 6 22 65.5 62.3
## 9 2011 7 29 29.6 31.9
## 10 2011 9 29 39.2 32.5
## 11 2011 10 9 61.9 59.5
## 12 2011 11 15 43.7 39.2
## 13 2011 12 29 26.3 30.8
## 14 2011 12 31 46.5 54.2
Both code chunks are fairly long, but you could argue that the second code chunk is more clear if you want to follow along through all of the operations. With the creation of intermediate variables in the first code chunk, you could possibly lose the “flow” of the code. By using %>%, you gain a more clear overview of the operations that are being performed on the data!
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
- mutate() adds new variables that are functions of existing variables
- select() picks variables based on their names.
- filter() picks cases based on their values.
- summarise() reduces multiple values down to a single summary.
- arrange() changes the ordering of the rows.
These all combine naturally with group_by() which allows you to perform any operation “by group”.
To practice “dplyr” skills, we will use the flights data set from the nycflights13 package.
This data frame comes from the US Bureau of Transportation Statistics and contains all 336,776 flights that departed from New York City in 2013.
It is documented in ?flights.
Filter
Read the dataframe “flights”, then filter only flights from 1st of January. Save the output to the new dataframe called “mydata”.
Filter flights operated by United (UA), American (AA), or Delta (DL):
Another useful dplyr filtering helper is between(). Filter flights departed between midnight and 6am (inclusive). Don’t forget flights that left at exactly midnight (2400).
Arrange
We also have need to make sure the data is ordered in a certain manner. This can be easily done in R with the arrange() function.
Again we can do this in base R but this is not always a clear path…
## # A tibble: 336,776 × 2
## carrier dep_delay
## <chr> <dbl>
## 1 B6 -43
## 2 DL -33
## 3 EV -32
## 4 DL -30
## 5 F9 -27
## 6 MQ -26
## 7 EV -25
## 8 MQ -25
## 9 9E -24
## 10 B6 -24
## # ℹ 336,766 more rows
Now, it’s your turn, to use “arrange” instead for the above base-R example:
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Mutate
mutate() returns a new data frame that contains the new variables appended to a copy of the original data set.
Note that when you use mutate() you can create multiple variables at once, and you can even refer to variables that are created earlier in the call to create other variables later in the call:
## # A tibble: 336,776 × 22
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, hours <dbl>,
## # gain_per_hour <dbl>
Transmute
mutate() will always return the new variables appended to a copy of the original data. If you want to return only the new variables, use transmute(). In the code below, replace mutate() with transmute() and then spot the difference in the results.
## # A tibble: 336,776 × 22
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, hours <dbl>,
## # gain_per_hour <dbl>
Find the 10 most delayed flights (dep_delay) using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank(). Hint: Once you compute a rank, you can filter the data set based on the ranks.
## # A tibble: 10 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Summarise
summarise() collapses a data frame to a single row of summaries. You get to choose how many summaries appear in the row and how they are computed:
## # A tibble: 1 × 2
## delay total
## <dbl> <dbl>
## 1 12.6 4152200
summarise() is not terribly useful unless you pair it with group_by().
group_by() changes the unit of analysis of the data frame: it assigns observations in the data frame to separate groups, and it instructs dplyr to apply functions separately to each group. group_by() assigns groups by grouping together observations that have the same combinations of values for the variables that you pass to group_by().
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE),
total = sum(dep_delay, na.rm = TRUE))## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day delay total
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 11.5 9678
## 2 2013 1 2 13.9 12958
## 3 2013 1 3 11.0 9933
## 4 2013 1 4 8.95 8137
## 5 2013 1 5 5.73 4110
## 6 2013 1 6 7.15 5940
## 7 2013 1 7 5.42 5038
## 8 2013 1 8 2.55 2285
## 9 2013 1 9 2.28 2042
## 10 2013 1 10 2.84 2643
## # ℹ 355 more rows
Which carrier has the worst delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint1: think about flights %>% group_by(carrier, dest) %>% summarise(n())).
Hint2: Use min_rank(desc(avg_delay)) to rank avg_delay (for example) such that the largest delay receives rank one.
flights %>%
filter(arr_delay>0) %>%
group_by(carrier) %>%
summarise(delay = mean(arr_delay)) %>%
arrange(desc(delay))## # A tibble: 16 × 2
## carrier delay
## <chr> <dbl>
## 1 OO 60.6
## 2 YV 51.1
## 3 9E 49.3
## 4 EV 48.3
## 5 F9 47.6
## 6 VX 43.8
## 7 FL 41.1
## 8 WN 40.7
## 9 B6 40.0
## 10 AA 38.3
## 11 MQ 37.9
## 12 DL 37.7
## 13 UA 36.7
## 14 HA 35.0
## 15 AS 34.4
## 16 US 29.0