First, before proceeding with this tutorial, we will have to install the following required packages:

Package hflights
Package tidyverse (This suite of packages include the package that we are considering within this tutorial, dplyr)

The following chunk will check each package’s availability and install them if necessary

required_packages <- c('hflights', 'dplyr')
for (p in required_packages) {
  if(!require(p,character.only = TRUE)) {
    install.packages(p, dep = TRUE)
  }
}

1 Introduction

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, developed by Hadley Wickham and Romain Francais. dplyr is the next iterations of plyr, focusing on only dataframes; thus, it is faster, has a more consistent API and should be easier to use.

To load or install the package, the easiest way is to load or install the whole suite of packages tidyverse. Alternatively, one can also independently load or install dplyr:

library('tidyverse')

## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library('dplyr')

2 Overview

dplyr provides a consistent set of verbs that help one solve the most common data manipulation challenges, for example

mutate() and trasmute add new varibales as functions of existing varibales
filter() selects cases based on their values
arrange() changes the ordering of the rows
summarise() reduces multiple values down to a single summary
select() and rename() chooses variables based on their names
sample_n() and sample_frac() takes random variables # Dataset

To explore the basic data manipulation functions within dplyr, we proceed with a built in hflights dataframe from the package with the same name.

First of all, we will load the package containing the dataset:

library('hflights')

This dataset contains 227494 flights that departed from Houston in the year 2011, collected by the US Bureau of Transportation Statistics. The documentations of this dataset can be found using the command ?hlights as follow:

?hflights

Now, we will find out the dimensions of this dataset, its class, along with displaying the first few rows in order to get an overview of how this dataset looks like

dim(hflights)

## [1] 227496     21

class(hflights)

## [1] "data.frame"

head(hflights)

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

As above, we can see that this dataset is implemented into R under the class data.frame, which is exactly what dplyr is designed to work with. However, as this can be considered to be a large dataset, it would be easier and worthwhile to change it to a tibble with the function tbl_df.

Here, we will discuss what is a tibble, and see why is it worth it to change a large dataset from class data.frame to tibble while working with dplyr.

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles do much less and complain much more than data frames, which means that one has to confront problems earlier, typically leading to cleaner, more expressive and productive code. To read more about the features, please refer to the fignette

vignette('tibble')

Here, we proceed to change the data frame hflights to tibble.

hflights <- tbl_df(hflights)
head(hflights)

## # A tibble: 6 x 21
##    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
## 1  2011     1          1         6    1400    1500 AA                  428
## 2  2011     1          2         7    1401    1501 AA                  428
## 3  2011     1          3         1    1352    1502 AA                  428
## 4  2011     1          4         2    1403    1513 AA                  428
## 5  2011     1          5         3    1405    1507 AA                  428
## 6  2011     1          6         4    1359    1503 AA                  428
## # … with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

class(hflights)

## [1] "tbl_df"     "tbl"        "data.frame"

2.1 Filter rows with `filter()`

The verb filter() allows one to selecta subset of the rows of a data frame. The first argument would be the name of the data frame, while the subsequent (following) arguments would refer to the variables within that data frame, selecting rows where the condition holds TRUE.

For example, to filter the flights departing from Houston on the final day of the year 2011 (i.e December 31st), we can select:

filter(hflights, Month == 12, DayofMonth == 31)

## # A tibble: 509 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011    12         31         6     556     918 AA                  466
##  2  2011    12         31         6    1156    1254 AA                  865
##  3  2011    12         31         6    1616    1714 AA                 1033
##  4  2011    12         31         6    1729    2044 AA                 1294
##  5  2011    12         31         6    1441    1542 AA                 1740
##  6  2011    12         31         6     901    1013 AA                 1788
##  7  2011    12         31         6    1024    1343 AA                 2234
##  8  2011    12         31         6    1843    2131 AS                  731
##  9  2011    12         31         6     831    1251 B6                  620
## 10  2011    12         31         6    1544    2009 B6                  622
## # … with 499 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

The above function is roughly equivalent to the following base R function:

hflights[hflights$Month == 12 & hflights$DayofMonth == 31,]

## # A tibble: 509 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011    12         31         6     556     918 AA                  466
##  2  2011    12         31         6    1156    1254 AA                  865
##  3  2011    12         31         6    1616    1714 AA                 1033
##  4  2011    12         31         6    1729    2044 AA                 1294
##  5  2011    12         31         6    1441    1542 AA                 1740
##  6  2011    12         31         6     901    1013 AA                 1788
##  7  2011    12         31         6    1024    1343 AA                 2234
##  8  2011    12         31         6    1843    2131 AS                  731
##  9  2011    12         31         6     831    1251 B6                  620
## 10  2011    12         31         6    1544    2009 B6                  622
## # … with 499 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

2.2 Arrange rows with `arrange()`

Instead of filetring or selecting rows from the given conditions, the function arrange() reorders them. This functions takes a data frame as an argument, and a set of column names following to order. If there are more than one column name given, the additional columns will be used to break ties in the value of the preceding columns:

arrange(hflights, Year, Month, DayofMonth)

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          1         6     728     840 AA                  460
##  3  2011     1          1         6    1631    1736 AA                 1121
##  4  2011     1          1         6    1756    2112 AA                 1294
##  5  2011     1          1         6    1012    1347 AA                 1700
##  6  2011     1          1         6    1211    1325 AA                 1820
##  7  2011     1          1         6     557     906 AA                 1994
##  8  2011     1          1         6    1824    2106 AS                  731
##  9  2011     1          1         6     654    1124 B6                  620
## 10  2011     1          1         6    1639    2110 B6                  622
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

The default order of arrange() is ascending; however, one can arrange the columns in descending order with desc():

arrange(hflights, desc(ArrDelay))

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011    12         12         1     650     808 AA                 1740
##  2  2011     8          1         1     156     452 CO                    1
##  3  2011    11          8         2     721     948 MQ                 3786
##  4  2011     6         21         2    2334     124 UA                  855
##  5  2011     5         20         5     858    1027 MQ                 3328
##  6  2011     6          9         4    2029    2243 MQ                 3859
##  7  2011     1         20         4     635     807 CO                   59
##  8  2011     6         22         3     908    1040 CO                  595
##  9  2011    12         13         2     706     824 MQ                 3328
## 10  2011    10         25         2    2310     149 DL                 1215
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

The verb arrange() works the exact same way as another built-in function of R (order()), but with much less typing. What we have already done with arrange() can also be written with order() as:

hflights[order(hflights$DayofMonth, hflights$Month, hflights$Year),]

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          1         6     728     840 AA                  460
##  3  2011     1          1         6    1631    1736 AA                 1121
##  4  2011     1          1         6    1756    2112 AA                 1294
##  5  2011     1          1         6    1012    1347 AA                 1700
##  6  2011     1          1         6    1211    1325 AA                 1820
##  7  2011     1          1         6     557     906 AA                 1994
##  8  2011     1          1         6    1824    2106 AS                  731
##  9  2011     1          1         6     654    1124 B6                  620
## 10  2011     1          1         6    1639    2110 B6                  622
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

hflights[order(desc(hflights$ArrTime)),]

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          2         7    2058    2400 WN                   51
##  2  2011     2          4         5    2301    2400 XE                 2450
##  3  2011     2         28         1    2116    2400 XE                 2680
##  4  2011     2         25         5    2302    2400 XE                 2823
##  5  2011     2          4         5    2300    2400 XE                 2956
##  6  2011     3         31         4    2112    2400 OO                 1156
##  7  2011     3         15         2    2113    2400 XE                 2708
##  8  2011     3         16         3    2110    2400 XE                 2708
##  9  2011     3         20         7    2106    2400 XE                 2708
## 10  2011     3         17         4    2258    2400 XE                 2823
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

hflights[order(-hflights$ArrTime),]

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          2         7    2058    2400 WN                   51
##  2  2011     2          4         5    2301    2400 XE                 2450
##  3  2011     2         28         1    2116    2400 XE                 2680
##  4  2011     2         25         5    2302    2400 XE                 2823
##  5  2011     2          4         5    2300    2400 XE                 2956
##  6  2011     3         31         4    2112    2400 OO                 1156
##  7  2011     3         15         2    2113    2400 XE                 2708
##  8  2011     3         16         3    2110    2400 XE                 2708
##  9  2011     3         20         7    2106    2400 XE                 2708
## 10  2011     3         17         4    2258    2400 XE                 2823
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

2.3 Select columns with `select()`

There are often times when one works with a dataset that there are only a few of the columns are of his/her interest. With that being said, dplyr allows the selection of a subset by column that usually only work for numerica variable positions:

#select columns by names 
select(hflights, Year, Month, DayofMonth)

## # A tibble: 227,496 x 3
##     Year Month DayofMonth
##    <int> <int>      <int>
##  1  2011     1          1
##  2  2011     1          2
##  3  2011     1          3
##  4  2011     1          4
##  5  2011     1          5
##  6  2011     1          6
##  7  2011     1          7
##  8  2011     1          8
##  9  2011     1          9
## 10  2011     1         10
## # … with 227,486 more rows

#select columns between two columns' names (inclusive)
select(hflights, Year:DayofMonth)

## # A tibble: 227,496 x 3
##     Year Month DayofMonth
##    <int> <int>      <int>
##  1  2011     1          1
##  2  2011     1          2
##  3  2011     1          3
##  4  2011     1          4
##  5  2011     1          5
##  6  2011     1          6
##  7  2011     1          7
##  8  2011     1          8
##  9  2011     1          9
## 10  2011     1         10
## # … with 227,486 more rows

#select all columns apart from the columns between two columns' names 
select(hflights, -(Year:DayofMonth))

## # A tibble: 227,496 x 18
##    DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
##        <int>   <int>   <int> <chr>             <int> <chr>  
##  1         6    1400    1500 AA                  428 N576AA 
##  2         7    1401    1501 AA                  428 N557AA 
##  3         1    1352    1502 AA                  428 N541AA 
##  4         2    1403    1513 AA                  428 N403AA 
##  5         3    1405    1507 AA                  428 N492AA 
##  6         4    1359    1503 AA                  428 N262AA 
##  7         5    1359    1509 AA                  428 N493AA 
##  8         6    1355    1454 AA                  428 N477AA 
##  9         7    1443    1554 AA                  428 N476AA 
## 10         1    1443    1553 AA                  428 N504AA 
## # … with 227,486 more rows, and 12 more variables:
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

There are a number of helper functions one can use within select, for example:

starts_with(): starts with a prefix
ends_with(): ends with a prefix
contains(): contains a literla string
mathces(): matches a regular expression
num_range(): matches a numerical range like x01, x02, x03
one_of(): matches variable names in a character vector

These helper functions will match larger blocks of varibales that meet certain criteria. For more information, please refer to ?select().

While using select(), one can also rename variables by using named arguments:

select(hflights, Tail_Num = TailNum)

## # A tibble: 227,496 x 1
##    Tail_Num
##    <chr>   
##  1 N576AA  
##  2 N557AA  
##  3 N541AA  
##  4 N403AA  
##  5 N492AA  
##  6 N262AA  
##  7 N493AA  
##  8 N477AA  
##  9 N476AA  
## 10 N504AA  
## # … with 227,486 more rows

Moreover, in order to rename a variable without having to choose it using select() and remove all the other variables, one can simply use rename() instead:

rename(hflights, Tail_Num = TailNum)

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 13 more variables: Tail_Num <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

2.4 Add new columns with `mutate()`

Besides selecting a sub set from the set of existing columns, there are often times when we have to add new columns as a function of existing columns. This is the place where mutate() comes to work.

mutate(hflights, 
       Diff = ArrDelay - DepDelay,
       AvgSpeed = Distance / AirTime * 60)

## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Diff <int>, AvgSpeed <dbl>

dplyr::mutate() is pretty similar to the function transform() from the package base; however, the difference here is that mutate() allows you to refer to the new columns that you have just created:

mutate(hflights,
       Diff = ArrDelay - DepDelay,
       DiffPerHour = Diff / (AirTime / 60))

## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Diff <int>, DiffPerHour <dbl>

If we only want to keeps the new variables that have just been created, it can simply be done by using transmute():

transmute(hflights,
          Diff = ArrDelay - DepDelay,
          DiffPerHour = Diff / (AirTime / 60))

## # A tibble: 227,496 x 2
##     Diff DiffPerHour
##    <int>       <dbl>
##  1   -10      -15   
##  2   -10      -13.3 
##  3     0        0   
##  4     0        0   
##  5    -8      -10.9 
##  6    -6       -8   
##  7     0        0   
##  8   -11      -16.5 
##  9     1        1.46
## 10     0        0   
## # … with 227,486 more rows

2.5 Randomly sample rows with `sample_n()` and `sample_frac()`

We can take a random sample of rows from the dataset, using sample_n() and sample_frac():

sample_n() for a fixed number of rows
sample_frac() for a fixed fraction of rows

sample_n(hflights, 10)

## # A tibble: 10 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     8          7         7    2106    2232 OO                 5835
##  2  2011    10          6         4    1759    1849 WN                   48
##  3  2011     5         28         6    1452    1510 OO                 1196
##  4  2011     4         26         2    1139    1551 CO                  409
##  5  2011    10          4         2    1823    1937 WN                 1592
##  6  2011     3         17         4    1549    1625 OO                 1173
##  7  2011     7         31         7    1700    1751 WN                   44
##  8  2011     7         15         5    1503    1602 XE                 2025
##  9  2011     7         16         6    2126    2227 MQ                 3724
## 10  2011     4          7         4    2119    2244 CO                 1597
## # … with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>

sample_frac(hflights, 0.001)

## # A tibble: 227 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         18         2      NA      NA CO                  106
##  2  2011     7         17         7    1908    2001 CO                 1048
##  3  2011     6          6         1    1432    1600 CO                  137
##  4  2011     7         11         1    1256    1626 CO                 1645
##  5  2011     1         20         4    1120    1416 XE                 2340
##  6  2011     9          2         5    1140    1254 CO                 1023
##  7  2011     4         18         1    1447    1554 XE                 2025
##  8  2011     5         14         6    1637    1804 XE                 2840
##  9  2011     3         28         1    1301    1530 XE                 2041
## 10  2011     6         28         2    1026    1357 AA                 1700
## # … with 217 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

2.6 Summarize variables with `summarise()`

The last verb in the package dplyr is summarise(), where it collapses a data frame into a single row, containing values based on the function that we indicate within the argument.

summarise(hflights,
          MeanDepDelay = mean(DepDelay, na.rm = TRUE))

## # A tibble: 1 x 1
##   MeanDepDelay
##          <dbl>
## 1         9.44

summarise(hflights,
          MaxArrDelay = max(ArrDelay, na.rm = TRUE))

## # A tibble: 1 x 1
##   MaxArrDelay
##         <int>
## 1         978

3 Grouped operations

Most of the time, the verbs (or functions) in dplyr can be used on their own. However, whenever we need to apply those functions to groups of observations within a dataset, there is rhe group_by() function. This allows a dataset to be broken down into specific groups of rows. Whenever one applies the verbs within this package on the resulting dataset, it will automatically applied by groups.

by_tailnum <- group_by(hflights, TailNum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(Distance, na.rm = TRUE),
  delay = mean(ArrDelay, na.rm = TRUE))
delay

## # A tibble: 3,320 x 4
##    TailNum count  dist  delay
##    <chr>   <int> <dbl>  <dbl>
##  1 ""        795  939. NaN   
##  2 N0EGMQ     40 1095.   1.92
##  3 N10156    317  802.   8.20
##  4 N10575     94  632.  18.1 
##  5 N11106    308  775.  10.1 
##  6 N11107    345  768.   8.05
##  7 N11109    331  772.  10.3 
##  8 N11113    282  773.   4.06
##  9 N11119    130  790.   7.40
## 10 N11121    333  775.   6.74
## # … with 3,310 more rows

by_tailnum

## # A tibble: 227,496 x 21
## # Groups:   TailNum [3,320]
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

4 Piping

Before the existence of pipes, there are two usual ways to apply different functions one after another: using intermediate steps, or writing nested function.

For intermediate steps, we usually create temporary data frames and use them as input for the upcoming functions. However, this can fill up the workspace and memory with lots of objects/datasets.
For nested function, it can be handy; however, it can also be difficult to read if two many functions are nested, as we have to read from inside out.

As a recent addition to R, pipes allow us to take the output of one function and send it directly to the next. Moreover, when a data frame is being passed through functions through a pipe, we do not need to include it as an argument to these functions anymore.

hflights %>%
  group_by(Year, Month, DayofMonth) %>%
  filter(UniqueCarrier == "AA") %>%
  select(ArrDelay, DepDelay) %>%
  summarise(
    arr = mean(ArrDelay, na.rm = TRUE),
    dep = mean(DepDelay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

## Adding missing grouping variables: `Year`, `Month`, `DayofMonth`

## # A tibble: 20 x 5
## # Groups:   Year, Month [11]
##     Year Month DayofMonth   arr   dep
##    <int> <int>      <int> <dbl> <dbl>
##  1  2011     1          9  28.2  31.5
##  2  2011     2          3  24.2  30.1
##  3  2011     2          4  76.8  69  
##  4  2011     3          3  27.9  34.6
##  5  2011     3         17  29.9  35.2
##  6  2011     4          4  56.4  39.1
##  7  2011     4         15  39.2  37.4
##  8  2011     4         25  36.9  14.6
##  9  2011     5         12  54.2  51.9
## 10  2011     5         20  74.8  58.5
## 11  2011     5         25  41.9  45.7
## 12  2011     6         21  43    47  
## 13  2011     6         24  30.7  35.3
## 14  2011     8         19  19.7  32.2
## 15  2011     9         27  21.2  32.1
## 16  2011    10          9  47    54.4
## 17  2011    10         23  54    63.4
## 18  2011    11         19  74    76.3
## 19  2011    12         12 116.  124. 
## 20  2011    12         22  75.9  82

5 Joining datasets

Often, when working with disparate datasets, we might want to join the data together on (a) common column(s). Moreover, there are often times that we are not able to upload the datasets to a database in order to join using SQL. This is when dplyr comes in.

The general syntax of these joins is as follow:

join_type(x, y, by = column_to_join)

Currently, dplyr supports four types of mutating joins, two types of filtering joins, and a nesting joins.

To see how each of the join types works, we will take a look at a small example below, with two dataframes: one containing purchasing habits of customers and another containing demographic information about such customers. These dataframes can be seen from a loyalty scheme, and is taken as a reference (with modifications) from the article “Joining data with dplyr in R” by Holly Emblem.

purchasing_habits <- data.frame("Customer ID"= c(1:3,5), 
                                "Number of orders"= c(5, 10, 4, 2),
                                "Total value of orders" = c(500, 240, 40, 100), 
                                "Average value of orders" = c(100,24,10,50), 
                                "Last order date" = as.Date(c("12/07/2018", "01/02/2018", "03/06/2017", "24/12/2017"), "%d/%m/%y"))

purchasing_habits

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
## 3           3                4                    40
## 4           5                2                   100
##   Average.value.of.orders Last.order.date
## 1                     100      2020-07-12
## 2                      24      2020-02-01
## 3                      10      2020-06-03
## 4                      50      2020-12-24

customer_information <- data.frame("Customer ID" = c(1,2,4),
                                   "Age" = c(50,44,30),
                                   "Gender" = c('M','F','F'),
                                   "Favorite product" = c('T-shirts','Leggings','Jumpers'))

customer_information

##   Customer.ID Age Gender Favorite.product
## 1           1  50      M         T-shirts
## 2           2  44      F         Leggings
## 3           4  30      F          Jumpers

First, mutating joins combine variables from the two data frames:

inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

inner_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
##   Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1                     100      2020-07-12  50      M         T-shirts
## 2                      24      2020-02-01  44      F         Leggings

left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

left_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
## 3           3                4                    40
## 4           5                2                   100
##   Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1                     100      2020-07-12  50      M         T-shirts
## 2                      24      2020-02-01  44      F         Leggings
## 3                      10      2020-06-03  NA   <NA>             <NA>
## 4                      50      2020-12-24  NA   <NA>             <NA>

right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned

right_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
## 3           4               NA                    NA
##   Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1                     100      2020-07-12  50      M         T-shirts
## 2                      24      2020-02-01  44      F         Leggings
## 3                      NA            <NA>  30      F          Jumpers

full_join() return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
## 3           3                4                    40
## 4           5                2                   100
## 5           4               NA                    NA
##   Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1                     100      2020-07-12  50      M         T-shirts
## 2                      24      2020-02-01  44      F         Leggings
## 3                      10      2020-06-03  NA   <NA>             <NA>
## 4                      50      2020-12-24  NA   <NA>             <NA>
## 5                      NA            <NA>  30      F          Jumpers

Filtering joins keep cases from the left-hand data frame:

semi_join() return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

semi_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           1                5                   500
## 2           2               10                   240
##   Average.value.of.orders Last.order.date
## 1                     100      2020-07-12
## 2                      24      2020-02-01

anti_join() return all rows from x where there are not matching values in y, keeping just columns from x.

anti_join(purchasing_habits, customer_information, by = "Customer.ID")

##   Customer.ID Number.of.orders Total.value.of.orders
## 1           3                4                    40
## 2           5                2                   100
##   Average.value.of.orders Last.order.date
## 1                      10      2020-06-03
## 2                      50      2020-12-24

Finally, nesting joins create a list column of data.frames

nest_join() return all rows and all columns from x. Adds a list column of tibbles. Each tibble contains all the rows from y that match that row of x. When there is no match, the list column is a 0-row tibble with the same column names and types as y. nest_join() is the most fundamental join since you can recreate the other joins from it.
An inner_join() is a nest_join() plus an tidyr::unnest()
An left_join() is a nest_join() plus an unnest(.drop = FALSE)
An semi_join() is a nest_join() plus a filter() where you check that every element of data has at least one row
An anti_join() is a nest_join() plus a filter() where you check every element has zero rows.

6 Drawbacks upon using `dplyr`

There are two main drawbacks considering the usage of dplyr:

First of all, most dplyr arguments are not referentially transparent. In other words, you can not replace a value with an equivalent subject that has been pre-defined elsewhere. For example, the code

filter(customer_information, Customer.ID == 1)

##   Customer.ID Age Gender Favorite.product
## 1           1  50      M         T-shirts

is neither equivalent to

x <- Customer.ID
filter(customer_information. x == 1)

nor to

x <- "Customer.ID"
filter(customer_information, x == 1)

## [1] Customer.ID      Age              Gender           Favorite.product
## <0 rows> (or 0-length row.names)

dplyr code is ambiguous, which means that it will be useful by saving time typing and spotting problems, but it makes functions and their behaviors more unpredictable.

References

“Dplyr Reference.” n.d. https://dplyr.tidyverse.org/reference/.

Emblem, Holly. 23AD–2018. “Joining Data with Dplyr in R.” Medium, July. https://medium.com/@HollyEmblem/joining-data-with-dplyr-in-r-874698eb8898.

Hadley Wickham, Lionel Henry, Romain Francois. 2019. Dplyr: Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

“Introduction to Dplyr.” n.d. https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.

“Programming with Dplyr.” n.d. https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html.

Wickham, Hadley. 2013. Hflights: Flights That Departed Houston in 2011. https://CRAN.R-project.org/package=hflights.

Introduction to `tidyverse` (Part 2): Data manipulation with `dplyr`

Hoang Anh NGO (École Polytechnique)

13 January 2019

1 Introduction

2 Overview

2.1 Filter rows with `filter()`

2.2 Arrange rows with `arrange()`

2.3 Select columns with `select()`

2.4 Add new columns with `mutate()`

2.5 Randomly sample rows with `sample_n()` and `sample_frac()`

2.6 Summarize variables with `summarise()`

3 Grouped operations

4 Piping

5 Joining datasets

6 Drawbacks upon using `dplyr`

References

Introduction to tidyverse (Part 2): Data manipulation with dplyr

Hoang Anh NGO (École Polytechnique)

13 January 2019

1 Introduction

2 Overview

2.1 Filter rows with filter()

2.2 Arrange rows with arrange()

2.3 Select columns with select()

2.4 Add new columns with mutate()

2.5 Randomly sample rows with sample_n() and sample_frac()

2.6 Summarize variables with summarise()

3 Grouped operations

4 Piping

5 Joining datasets

6 Drawbacks upon using dplyr

References

Introduction to `tidyverse` (Part 2): Data manipulation with `dplyr`

2.1 Filter rows with `filter()`

2.2 Arrange rows with `arrange()`

2.3 Select columns with `select()`

2.4 Add new columns with `mutate()`

2.5 Randomly sample rows with `sample_n()` and `sample_frac()`

2.6 Summarize variables with `summarise()`

6 Drawbacks upon using `dplyr`