tidyverse (Part 2): Data manipulation with dplyrFirst, before proceeding with this tutorial, we will have to install the following required packages:
hflightstidyverse (This suite of packages include the package that we are considering within this tutorial, dplyr)The following chunk will check each package’s availability and install them if necessary
required_packages <- c('hflights', 'dplyr')
for (p in required_packages) {
if(!require(p,character.only = TRUE)) {
install.packages(p, dep = TRUE)
}
}
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges, developed by Hadley Wickham and Romain Francais. dplyr is the next iterations of plyr, focusing on only dataframes; thus, it is faster, has a more consistent API and should be easier to use.
To load or install the package, the easiest way is to load or install the whole suite of packages tidyverse. Alternatively, one can also independently load or install dplyr:
library('tidyverse')
## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
or
library('dplyr')
dplyr provides a consistent set of verbs that help one solve the most common data manipulation challenges, for example
mutate() and trasmute add new varibales as functions of existing varibalesfilter() selects cases based on their valuesarrange() changes the ordering of the rowssummarise() reduces multiple values down to a single summaryselect() and rename() chooses variables based on their namessample_n() and sample_frac() takes random variables # DatasetTo explore the basic data manipulation functions within dplyr, we proceed with a built in hflights dataframe from the package with the same name.
First of all, we will load the package containing the dataset:
library('hflights')
This dataset contains 227494 flights that departed from Houston in the year 2011, collected by the US Bureau of Transportation Statistics. The documentations of this dataset can be found using the command ?hlights as follow:
?hflights
Now, we will find out the dimensions of this dataset, its class, along with displaying the first few rows in order to get an overview of how this dataset looks like
dim(hflights)
## [1] 227496 21
class(hflights)
## [1] "data.frame"
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
As above, we can see that this dataset is implemented into R under the class data.frame, which is exactly what dplyr is designed to work with. However, as this can be considered to be a large dataset, it would be easier and worthwhile to change it to a tibble with the function tbl_df.
Here, we will discuss what is a tibble, and see why is it worth it to change a large dataset from class data.frame to tibble while working with dplyr.
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles do much less and complain much more than data frames, which means that one has to confront problems earlier, typically leading to cleaner, more expressive and productive code. To read more about the features, please refer to the fignette
vignette('tibble')
Here, we proceed to change the data frame hflights to tibble.
hflights <- tbl_df(hflights)
head(hflights)
## # A tibble: 6 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## # … with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
class(hflights)
## [1] "tbl_df" "tbl" "data.frame"
filter()The verb filter() allows one to selecta subset of the rows of a data frame. The first argument would be the name of the data frame, while the subsequent (following) arguments would refer to the variables within that data frame, selecting rows where the condition holds TRUE.
For example, to filter the flights departing from Houston on the final day of the year 2011 (i.e December 31st), we can select:
filter(hflights, Month == 12, DayofMonth == 31)
## # A tibble: 509 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 12 31 6 556 918 AA 466
## 2 2011 12 31 6 1156 1254 AA 865
## 3 2011 12 31 6 1616 1714 AA 1033
## 4 2011 12 31 6 1729 2044 AA 1294
## 5 2011 12 31 6 1441 1542 AA 1740
## 6 2011 12 31 6 901 1013 AA 1788
## 7 2011 12 31 6 1024 1343 AA 2234
## 8 2011 12 31 6 1843 2131 AS 731
## 9 2011 12 31 6 831 1251 B6 620
## 10 2011 12 31 6 1544 2009 B6 622
## # … with 499 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
The above function is roughly equivalent to the following base R function:
hflights[hflights$Month == 12 & hflights$DayofMonth == 31,]
## # A tibble: 509 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 12 31 6 556 918 AA 466
## 2 2011 12 31 6 1156 1254 AA 865
## 3 2011 12 31 6 1616 1714 AA 1033
## 4 2011 12 31 6 1729 2044 AA 1294
## 5 2011 12 31 6 1441 1542 AA 1740
## 6 2011 12 31 6 901 1013 AA 1788
## 7 2011 12 31 6 1024 1343 AA 2234
## 8 2011 12 31 6 1843 2131 AS 731
## 9 2011 12 31 6 831 1251 B6 620
## 10 2011 12 31 6 1544 2009 B6 622
## # … with 499 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
arrange()Instead of filetring or selecting rows from the given conditions, the function arrange() reorders them. This functions takes a data frame as an argument, and a set of column names following to order. If there are more than one column name given, the additional columns will be used to break ties in the value of the preceding columns:
arrange(hflights, Year, Month, DayofMonth)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 1 6 728 840 AA 460
## 3 2011 1 1 6 1631 1736 AA 1121
## 4 2011 1 1 6 1756 2112 AA 1294
## 5 2011 1 1 6 1012 1347 AA 1700
## 6 2011 1 1 6 1211 1325 AA 1820
## 7 2011 1 1 6 557 906 AA 1994
## 8 2011 1 1 6 1824 2106 AS 731
## 9 2011 1 1 6 654 1124 B6 620
## 10 2011 1 1 6 1639 2110 B6 622
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
The default order of arrange() is ascending; however, one can arrange the columns in descending order with desc():
arrange(hflights, desc(ArrDelay))
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 12 12 1 650 808 AA 1740
## 2 2011 8 1 1 156 452 CO 1
## 3 2011 11 8 2 721 948 MQ 3786
## 4 2011 6 21 2 2334 124 UA 855
## 5 2011 5 20 5 858 1027 MQ 3328
## 6 2011 6 9 4 2029 2243 MQ 3859
## 7 2011 1 20 4 635 807 CO 59
## 8 2011 6 22 3 908 1040 CO 595
## 9 2011 12 13 2 706 824 MQ 3328
## 10 2011 10 25 2 2310 149 DL 1215
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
The verb arrange() works the exact same way as another built-in function of R (order()), but with much less typing. What we have already done with arrange() can also be written with order() as:
hflights[order(hflights$DayofMonth, hflights$Month, hflights$Year),]
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 1 6 728 840 AA 460
## 3 2011 1 1 6 1631 1736 AA 1121
## 4 2011 1 1 6 1756 2112 AA 1294
## 5 2011 1 1 6 1012 1347 AA 1700
## 6 2011 1 1 6 1211 1325 AA 1820
## 7 2011 1 1 6 557 906 AA 1994
## 8 2011 1 1 6 1824 2106 AS 731
## 9 2011 1 1 6 654 1124 B6 620
## 10 2011 1 1 6 1639 2110 B6 622
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
hflights[order(desc(hflights$ArrTime)),]
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 2 7 2058 2400 WN 51
## 2 2011 2 4 5 2301 2400 XE 2450
## 3 2011 2 28 1 2116 2400 XE 2680
## 4 2011 2 25 5 2302 2400 XE 2823
## 5 2011 2 4 5 2300 2400 XE 2956
## 6 2011 3 31 4 2112 2400 OO 1156
## 7 2011 3 15 2 2113 2400 XE 2708
## 8 2011 3 16 3 2110 2400 XE 2708
## 9 2011 3 20 7 2106 2400 XE 2708
## 10 2011 3 17 4 2258 2400 XE 2823
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
or
hflights[order(-hflights$ArrTime),]
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 2 7 2058 2400 WN 51
## 2 2011 2 4 5 2301 2400 XE 2450
## 3 2011 2 28 1 2116 2400 XE 2680
## 4 2011 2 25 5 2302 2400 XE 2823
## 5 2011 2 4 5 2300 2400 XE 2956
## 6 2011 3 31 4 2112 2400 OO 1156
## 7 2011 3 15 2 2113 2400 XE 2708
## 8 2011 3 16 3 2110 2400 XE 2708
## 9 2011 3 20 7 2106 2400 XE 2708
## 10 2011 3 17 4 2258 2400 XE 2823
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
select()There are often times when one works with a dataset that there are only a few of the columns are of his/her interest. With that being said, dplyr allows the selection of a subset by column that usually only work for numerica variable positions:
#select columns by names
select(hflights, Year, Month, DayofMonth)
## # A tibble: 227,496 x 3
## Year Month DayofMonth
## <int> <int> <int>
## 1 2011 1 1
## 2 2011 1 2
## 3 2011 1 3
## 4 2011 1 4
## 5 2011 1 5
## 6 2011 1 6
## 7 2011 1 7
## 8 2011 1 8
## 9 2011 1 9
## 10 2011 1 10
## # … with 227,486 more rows
#select columns between two columns' names (inclusive)
select(hflights, Year:DayofMonth)
## # A tibble: 227,496 x 3
## Year Month DayofMonth
## <int> <int> <int>
## 1 2011 1 1
## 2 2011 1 2
## 3 2011 1 3
## 4 2011 1 4
## 5 2011 1 5
## 6 2011 1 6
## 7 2011 1 7
## 8 2011 1 8
## 9 2011 1 9
## 10 2011 1 10
## # … with 227,486 more rows
#select all columns apart from the columns between two columns' names
select(hflights, -(Year:DayofMonth))
## # A tibble: 227,496 x 18
## DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum
## <int> <int> <int> <chr> <int> <chr>
## 1 6 1400 1500 AA 428 N576AA
## 2 7 1401 1501 AA 428 N557AA
## 3 1 1352 1502 AA 428 N541AA
## 4 2 1403 1513 AA 428 N403AA
## 5 3 1405 1507 AA 428 N492AA
## 6 4 1359 1503 AA 428 N262AA
## 7 5 1359 1509 AA 428 N493AA
## 8 6 1355 1454 AA 428 N477AA
## 9 7 1443 1554 AA 428 N476AA
## 10 1 1443 1553 AA 428 N504AA
## # … with 227,486 more rows, and 12 more variables:
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
There are a number of helper functions one can use within select, for example:
starts_with(): starts with a prefixends_with(): ends with a prefixcontains(): contains a literla stringmathces(): matches a regular expressionnum_range(): matches a numerical range like x01, x02, x03one_of(): matches variable names in a character vectorThese helper functions will match larger blocks of varibales that meet certain criteria. For more information, please refer to ?select().
While using select(), one can also rename variables by using named arguments:
select(hflights, Tail_Num = TailNum)
## # A tibble: 227,496 x 1
## Tail_Num
## <chr>
## 1 N576AA
## 2 N557AA
## 3 N541AA
## 4 N403AA
## 5 N492AA
## 6 N262AA
## 7 N493AA
## 8 N477AA
## 9 N476AA
## 10 N504AA
## # … with 227,486 more rows
Moreover, in order to rename a variable without having to choose it using select() and remove all the other variables, one can simply use rename() instead:
rename(hflights, Tail_Num = TailNum)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 13 more variables: Tail_Num <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
mutate()Besides selecting a sub set from the set of existing columns, there are often times when we have to add new columns as a function of existing columns. This is the place where mutate() comes to work.
mutate(hflights,
Diff = ArrDelay - DepDelay,
AvgSpeed = Distance / AirTime * 60)
## # A tibble: 227,496 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Diff <int>, AvgSpeed <dbl>
dplyr::mutate() is pretty similar to the function transform() from the package base; however, the difference here is that mutate() allows you to refer to the new columns that you have just created:
mutate(hflights,
Diff = ArrDelay - DepDelay,
DiffPerHour = Diff / (AirTime / 60))
## # A tibble: 227,496 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Diff <int>, DiffPerHour <dbl>
If we only want to keeps the new variables that have just been created, it can simply be done by using transmute():
transmute(hflights,
Diff = ArrDelay - DepDelay,
DiffPerHour = Diff / (AirTime / 60))
## # A tibble: 227,496 x 2
## Diff DiffPerHour
## <int> <dbl>
## 1 -10 -15
## 2 -10 -13.3
## 3 0 0
## 4 0 0
## 5 -8 -10.9
## 6 -6 -8
## 7 0 0
## 8 -11 -16.5
## 9 1 1.46
## 10 0 0
## # … with 227,486 more rows
sample_n() and sample_frac()We can take a random sample of rows from the dataset, using sample_n() and sample_frac():
sample_n() for a fixed number of rowssample_frac() for a fixed fraction of rowssample_n(hflights, 10)
## # A tibble: 10 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 8 7 7 2106 2232 OO 5835
## 2 2011 10 6 4 1759 1849 WN 48
## 3 2011 5 28 6 1452 1510 OO 1196
## 4 2011 4 26 2 1139 1551 CO 409
## 5 2011 10 4 2 1823 1937 WN 1592
## 6 2011 3 17 4 1549 1625 OO 1173
## 7 2011 7 31 7 1700 1751 WN 44
## 8 2011 7 15 5 1503 1602 XE 2025
## 9 2011 7 16 6 2126 2227 MQ 3724
## 10 2011 4 7 4 2119 2244 CO 1597
## # … with 13 more variables: TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>
sample_frac(hflights, 0.001)
## # A tibble: 227 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 18 2 NA NA CO 106
## 2 2011 7 17 7 1908 2001 CO 1048
## 3 2011 6 6 1 1432 1600 CO 137
## 4 2011 7 11 1 1256 1626 CO 1645
## 5 2011 1 20 4 1120 1416 XE 2340
## 6 2011 9 2 5 1140 1254 CO 1023
## 7 2011 4 18 1 1447 1554 XE 2025
## 8 2011 5 14 6 1637 1804 XE 2840
## 9 2011 3 28 1 1301 1530 XE 2041
## 10 2011 6 28 2 1026 1357 AA 1700
## # … with 217 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
summarise()The last verb in the package dplyr is summarise(), where it collapses a data frame into a single row, containing values based on the function that we indicate within the argument.
summarise(hflights,
MeanDepDelay = mean(DepDelay, na.rm = TRUE))
## # A tibble: 1 x 1
## MeanDepDelay
## <dbl>
## 1 9.44
summarise(hflights,
MaxArrDelay = max(ArrDelay, na.rm = TRUE))
## # A tibble: 1 x 1
## MaxArrDelay
## <int>
## 1 978
Most of the time, the verbs (or functions) in dplyr can be used on their own. However, whenever we need to apply those functions to groups of observations within a dataset, there is rhe group_by() function. This allows a dataset to be broken down into specific groups of rows. Whenever one applies the verbs within this package on the resulting dataset, it will automatically applied by groups.
by_tailnum <- group_by(hflights, TailNum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(Distance, na.rm = TRUE),
delay = mean(ArrDelay, na.rm = TRUE))
delay
## # A tibble: 3,320 x 4
## TailNum count dist delay
## <chr> <int> <dbl> <dbl>
## 1 "" 795 939. NaN
## 2 N0EGMQ 40 1095. 1.92
## 3 N10156 317 802. 8.20
## 4 N10575 94 632. 18.1
## 5 N11106 308 775. 10.1
## 6 N11107 345 768. 8.05
## 7 N11109 331 772. 10.3
## 8 N11113 282 773. 4.06
## 9 N11119 130 790. 7.40
## 10 N11121 333 775. 6.74
## # … with 3,310 more rows
by_tailnum
## # A tibble: 227,496 x 21
## # Groups: TailNum [3,320]
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
Before the existence of pipes, there are two usual ways to apply different functions one after another: using intermediate steps, or writing nested function.
As a recent addition to R, pipes allow us to take the output of one function and send it directly to the next. Moreover, when a data frame is being passed through functions through a pipe, we do not need to include it as an argument to these functions anymore.
hflights %>%
group_by(Year, Month, DayofMonth) %>%
filter(UniqueCarrier == "AA") %>%
select(ArrDelay, DepDelay) %>%
summarise(
arr = mean(ArrDelay, na.rm = TRUE),
dep = mean(DepDelay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
## Adding missing grouping variables: `Year`, `Month`, `DayofMonth`
## # A tibble: 20 x 5
## # Groups: Year, Month [11]
## Year Month DayofMonth arr dep
## <int> <int> <int> <dbl> <dbl>
## 1 2011 1 9 28.2 31.5
## 2 2011 2 3 24.2 30.1
## 3 2011 2 4 76.8 69
## 4 2011 3 3 27.9 34.6
## 5 2011 3 17 29.9 35.2
## 6 2011 4 4 56.4 39.1
## 7 2011 4 15 39.2 37.4
## 8 2011 4 25 36.9 14.6
## 9 2011 5 12 54.2 51.9
## 10 2011 5 20 74.8 58.5
## 11 2011 5 25 41.9 45.7
## 12 2011 6 21 43 47
## 13 2011 6 24 30.7 35.3
## 14 2011 8 19 19.7 32.2
## 15 2011 9 27 21.2 32.1
## 16 2011 10 9 47 54.4
## 17 2011 10 23 54 63.4
## 18 2011 11 19 74 76.3
## 19 2011 12 12 116. 124.
## 20 2011 12 22 75.9 82
Often, when working with disparate datasets, we might want to join the data together on (a) common column(s). Moreover, there are often times that we are not able to upload the datasets to a database in order to join using SQL. This is when dplyr comes in.
The general syntax of these joins is as follow:
join_type(x, y, by = column_to_join)
Currently, dplyr supports four types of mutating joins, two types of filtering joins, and a nesting joins.
To see how each of the join types works, we will take a look at a small example below, with two dataframes: one containing purchasing habits of customers and another containing demographic information about such customers. These dataframes can be seen from a loyalty scheme, and is taken as a reference (with modifications) from the article “Joining data with dplyr in R” by Holly Emblem.
purchasing_habits <- data.frame("Customer ID"= c(1:3,5),
"Number of orders"= c(5, 10, 4, 2),
"Total value of orders" = c(500, 240, 40, 100),
"Average value of orders" = c(100,24,10,50),
"Last order date" = as.Date(c("12/07/2018", "01/02/2018", "03/06/2017", "24/12/2017"), "%d/%m/%y"))
purchasing_habits
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## 3 3 4 40
## 4 5 2 100
## Average.value.of.orders Last.order.date
## 1 100 2020-07-12
## 2 24 2020-02-01
## 3 10 2020-06-03
## 4 50 2020-12-24
customer_information <- data.frame("Customer ID" = c(1,2,4),
"Age" = c(50,44,30),
"Gender" = c('M','F','F'),
"Favorite product" = c('T-shirts','Leggings','Jumpers'))
customer_information
## Customer.ID Age Gender Favorite.product
## 1 1 50 M T-shirts
## 2 2 44 F Leggings
## 3 4 30 F Jumpers
First, mutating joins combine variables from the two data frames:
inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.inner_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1 100 2020-07-12 50 M T-shirts
## 2 24 2020-02-01 44 F Leggings
left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.left_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## 3 3 4 40
## 4 5 2 100
## Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1 100 2020-07-12 50 M T-shirts
## 2 24 2020-02-01 44 F Leggings
## 3 10 2020-06-03 NA <NA> <NA>
## 4 50 2020-12-24 NA <NA> <NA>
right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returnedright_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## 3 4 NA NA
## Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1 100 2020-07-12 50 M T-shirts
## 2 24 2020-02-01 44 F Leggings
## 3 NA <NA> 30 F Jumpers
full_join() return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.full_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## 3 3 4 40
## 4 5 2 100
## 5 4 NA NA
## Average.value.of.orders Last.order.date Age Gender Favorite.product
## 1 100 2020-07-12 50 M T-shirts
## 2 24 2020-02-01 44 F Leggings
## 3 10 2020-06-03 NA <NA> <NA>
## 4 50 2020-12-24 NA <NA> <NA>
## 5 NA <NA> 30 F Jumpers
Filtering joins keep cases from the left-hand data frame:
semi_join() return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.semi_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 1 5 500
## 2 2 10 240
## Average.value.of.orders Last.order.date
## 1 100 2020-07-12
## 2 24 2020-02-01
anti_join() return all rows from x where there are not matching values in y, keeping just columns from x.anti_join(purchasing_habits, customer_information, by = "Customer.ID")
## Customer.ID Number.of.orders Total.value.of.orders
## 1 3 4 40
## 2 5 2 100
## Average.value.of.orders Last.order.date
## 1 10 2020-06-03
## 2 50 2020-12-24
Finally, nesting joins create a list column of data.frames
nest_join() return all rows and all columns from x. Adds a list column of tibbles. Each tibble contains all the rows from y that match that row of x. When there is no match, the list column is a 0-row tibble with the same column names and types as y. nest_join() is the most fundamental join since you can recreate the other joins from it.
inner_join() is a nest_join() plus an tidyr::unnest()left_join() is a nest_join() plus an unnest(.drop = FALSE)semi_join() is a nest_join() plus a filter() where you check that every element of data has at least one rowAn anti_join() is a nest_join() plus a filter() where you check every element has zero rows.
dplyrThere are two main drawbacks considering the usage of dplyr:
dplyr arguments are not referentially transparent. In other words, you can not replace a value with an equivalent subject that has been pre-defined elsewhere. For example, the codefilter(customer_information, Customer.ID == 1)
## Customer.ID Age Gender Favorite.product
## 1 1 50 M T-shirts
is neither equivalent to
x <- Customer.ID
filter(customer_information. x == 1)
nor to
x <- "Customer.ID"
filter(customer_information, x == 1)
## [1] Customer.ID Age Gender Favorite.product
## <0 rows> (or 0-length row.names)
dplyr code is ambiguous, which means that it will be useful by saving time typing and spotting problems, but it makes functions and their behaviors more unpredictable.“Dplyr Reference.” n.d. https://dplyr.tidyverse.org/reference/.
Emblem, Holly. 23AD–2018. “Joining Data with Dplyr in R.” Medium, July. https://medium.com/@HollyEmblem/joining-data-with-dplyr-in-r-874698eb8898.
Hadley Wickham, Lionel Henry, Romain Francois. 2019. Dplyr: Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.
“Introduction to Dplyr.” n.d. https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.
“Programming with Dplyr.” n.d. https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html.
Wickham, Hadley. 2013. Hflights: Flights That Departed Houston in 2011. https://CRAN.R-project.org/package=hflights.