Welcome to the interactive exercises part of your dplyr course. Here you will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.
Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.
Both dplyr and hflights are already installed on DataCamp’s servers, so loading them with library() will get you up and running.
# Load the dplyr package
library(dplyr)
# Load the hflights package
library(hflights)
As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble().
The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.
# Both the dplyr and hflights packages are loaded
# Convert the hflights data.frame into a hflights tbl
hflights <- as_tibble(hflights)
# Display the hflights tbl
hflights
# A tibble: 227,496 x 21
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 1 6 1400 1500 AA 428
2 2011 1 2 7 1401 1501 AA 428
3 2011 1 3 1 1352 1502 AA 428
4 2011 1 4 2 1403 1513 AA 428
5 2011 1 5 3 1405 1507 AA 428
6 2011 1 6 4 1359 1503 AA 428
7 2011 1 7 5 1359 1509 AA 428
8 2011 1 8 6 1355 1454 AA 428
9 2011 1 9 7 1443 1554 AA 428
10 2011 1 10 1 1443 1553 AA 428
# ... with 227,486 more rows, and 13 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>
# Create the object carriers
carriers <- hflights$UniqueCarrier
A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.
To do this, let’s work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:
two <- c("AA", "AS")
lut <- c("AA" = "American",
"AS" = "Alaska",
"B6" = "JetBlue")
two <- lut[two]
two
AA AS
"American" "Alaska"
# Both the dplyr and hflights packages are loaded into workspace
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
# Add the Carrier column to hflights
hflights$Carrier <- lut[hflights$UniqueCarrier]
# Glimpse at hflights
glimpse(hflights)
Observations: 227,496
Variables: 22
$ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
$ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
$ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
$ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
$ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
$ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
$ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
$ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
$ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
$ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
$ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
$ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
$ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
$ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
$ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
$ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
$ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
$ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Carrier <chr> "American", "American", "American", "America...
Let’s try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute
unique(hflights$CancellationCode)
[1] "" "A" "B" "C" "D"
A lookup table lut has already been created for you, that converts the alphabetical codes into more meaningful strings.
# The hflights tbl you built in the previous exercise is available in the workspace.
# The lookup table
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
# Add the Code column
# hflights$Code <- lut[hflights$CancellationCode]
# Glimpse at hflights
# glimpse(hflights)
To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().
As an example, take the following call, that selects the variables var1 and var2 from the data frame df.
select(df, var1, var2)
You can also use : to select a range of variables and - to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable’s names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:
select(df, 1:4, -2)
select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.
# hflights is pre-loaded as a tbl, together with the necessary libraries.
# Print out a tbl with the four columns of hflights related to delay
select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)
# A tibble: 227,496 x 4
ActualElapsedTime AirTime ArrDelay DepDelay
<int> <int> <int> <int>
1 60 40 -10 0
2 60 45 -9 1
3 70 48 -8 -8
4 70 39 3 3
5 62 44 -3 5
6 64 45 -7 -1
7 70 43 -1 -1
8 59 40 -16 -5
9 71 41 44 43
10 70 45 43 43
# ... with 227,486 more rows
# Print out the columns Origin up to Cancelled of hflights
select(hflights, Origin:Cancelled)
# A tibble: 227,496 x 6
Origin Dest Distance TaxiIn TaxiOut Cancelled
<chr> <chr> <int> <int> <int> <int>
1 IAH DFW 224 7 13 0
2 IAH DFW 224 6 9 0
3 IAH DFW 224 5 17 0
4 IAH DFW 224 9 22 0
5 IAH DFW 224 9 9 0
6 IAH DFW 224 6 13 0
7 IAH DFW 224 12 15 0
8 IAH DFW 224 7 12 0
9 IAH DFW 224 8 22 0
10 IAH DFW 224 6 19 0
# ... with 227,486 more rows
# Answer to last question: be concise!
select(hflights, - (DepTime:AirTime))
# A tibble: 227,496 x 15
Year Month DayofMonth DayOfWeek ArrDelay DepDelay Origin Dest Distance
<int> <int> <int> <int> <int> <int> <chr> <chr> <int>
1 2011 1 1 6 -10 0 IAH DFW 224
2 2011 1 2 7 -9 1 IAH DFW 224
3 2011 1 3 1 -8 -8 IAH DFW 224
4 2011 1 4 2 3 3 IAH DFW 224
5 2011 1 5 3 -3 5 IAH DFW 224
6 2011 1 6 4 -7 -1 IAH DFW 224
7 2011 1 7 5 -1 -1 IAH DFW 224
8 2011 1 8 6 -16 -5 IAH DFW 224
9 2011 1 9 7 44 43 IAH DFW 224
10 2011 1 10 1 43 43 IAH DFW 224
# ... with 227,486 more rows, and 6 more variables: TaxiIn <int>,
# TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
select()
dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:
starts_with("X")
: every name that starts with “X”,ends_with("X")
: every name that ends with “X”,contains("X")
: every name that contains “X”,matches("X")
: every name that matches “X”, where “X” can be a regular expression,num_range("x", 1:5)
: the variables named x01, x02, x03, x04 and x05,one_of(x)
: every name that appears in x, which should be a character vector.Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.
# As usual, hflights is pre-loaded as a tbl, together with the necessary libraries.
# Print out a tbl containing just ArrDelay and DepDelay
select(hflights, ends_with("Delay"))
# A tibble: 227,496 x 2
ArrDelay DepDelay
<int> <int>
1 -10 0
2 -9 1
3 -8 -8
4 3 3
5 -3 5
6 -7 -1
7 -1 -1
8 -16 -5
9 44 43
10 43 43
# ... with 227,486 more rows
# Print out a tbl as described in the second instruction, using both helper functions and variable names
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
# A tibble: 227,496 x 5
UniqueCarrier FlightNum TailNum Cancelled CancellationCode
<chr> <int> <chr> <int> <chr>
1 AA 428 N576AA 0 ""
2 AA 428 N557AA 0 ""
3 AA 428 N541AA 0 ""
4 AA 428 N403AA 0 ""
5 AA 428 N492AA 0 ""
6 AA 428 N262AA 0 ""
7 AA 428 N493AA 0 ""
8 AA 428 N477AA 0 ""
9 AA 428 N476AA 0 ""
10 AA 428 N504AA 0 ""
# ... with 227,486 more rows
# Print out tbl as described in the third instruction, using only helper functions.
select(hflights, contains("Tim"), contains("Del"))
# A tibble: 227,496 x 6
DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
<int> <int> <int> <int> <int> <int>
1 1400 1500 60 40 -10 0
2 1401 1501 60 45 -9 1
3 1352 1502 70 48 -8 -8
4 1403 1513 70 39 3 3
5 1405 1507 62 44 -3 5
6 1359 1503 64 45 -7 -1
7 1359 1509 70 43 -1 -1
8 1355 1454 59 40 -16 -5
9 1443 1554 71 41 44 43
10 1443 1553 70 45 43 43
# ... with 227,486 more rows
# both hflights and dplyr are available
# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, contains("Taxi"), Distance)
# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year:ArrTime, -DayofMonth)
# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, starts_with("T"))
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:
mutate(df, z = x + y)
# hflights and dplyr are loaded and ready to serve you.
# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)
# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = 60 * Distance / AirTime)
# Print out g3
g3
# A tibble: 227,496 x 25
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 1 6 1400 1500 AA 428
2 2011 1 2 7 1401 1501 AA 428
3 2011 1 3 1 1352 1502 AA 428
4 2011 1 4 2 1403 1513 AA 428
5 2011 1 5 3 1405 1507 AA 428
6 2011 1 6 4 1359 1503 AA 428
7 2011 1 7 5 1359 1509 AA 428
8 2011 1 8 6 1355 1454 AA 428
9 2011 1 9 7 1443 1554 AA 428
10 2011 1 10 1 1443 1553 AA 428
# ... with 227,486 more rows, and 17 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>, ActualGroundTime <int>,
# GroundTime <int>, AverageSpeed <dbl>
So far you’ve added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().
mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:
mutate(my_df, x = a + b, y = x + c)
# hflights and dplyr are ready, are you?
# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)
# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut,
ActualGroundTime = ActualElapsedTime - AirTime,
Diff = TotalTaxi - ActualGroundTime)
Logical operators R comes with a set of logical operators that you can use inside filter():
x < y
, TRUE if x is less than yx <= y
, TRUE if x is less than or equal to yx == y
, TRUE if x equals yx != y
, TRUE if x does not equal yx >= y
, TRUE if x is greater than or equal to yx > y
, TRUE if x is greater than yx %in% c(a, b, c)
, TRUE if x is in the vector c(a, b, c)The following example filters df such that only the observations for which a is positive, are kept:
filter(df, a > 0)
# hflights is at your disposal as a tbl, with clean carrier names
# All flights that traveled 3000 miles or more
filter(hflights, Distance >= 3000)
# A tibble: 527 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 31 1 924 1413 CO 1
2 2011 1 30 7 925 1410 CO 1
3 2011 1 29 6 1045 1445 CO 1
4 2011 1 28 5 1516 1916 CO 1
5 2011 1 27 4 950 1344 CO 1
6 2011 1 26 3 944 1350 CO 1
7 2011 1 25 2 924 1337 CO 1
8 2011 1 24 1 1144 1605 CO 1
9 2011 1 23 7 926 1335 CO 1
10 2011 1 22 6 942 1340 CO 1
# ... with 517 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# All flights flown by JetBlue, Southwest, or Delta
filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))
# A tibble: 0 x 22
# ... with 22 variables: Year <int>, Month <int>, DayofMonth <int>,
# DayOfWeek <int>, DepTime <int>, ArrTime <int>, UniqueCarrier <chr>,
# FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>,
# AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
# Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
# Cancelled <int>, CancellationCode <chr>, Diverted <int>, Carrier <chr>
# All flights where taxiing took longer than flying
filter(hflights, TaxiIn + TaxiOut > AirTime)
# A tibble: 1,389 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 24 1 731 904 AA 460
2 2011 1 30 7 1959 2132 AA 533
3 2011 1 24 1 1621 1749 AA 1121
4 2011 1 10 1 941 1113 AA 1436
5 2011 1 31 1 1301 1356 CO 241
6 2011 1 31 1 2113 2215 CO 1533
7 2011 1 31 1 1434 1539 CO 1541
8 2011 1 31 1 900 1006 CO 1583
9 2011 1 30 7 1304 1408 CO 241
10 2011 1 30 7 2004 2128 CO 423
# ... with 1,379 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:
filter(df, a > 0 & b > 0)
filter(df, a > 0, b > 0)
Next, is.na() will also come in handy. This example keeps the observations in df for which the variable x is not NA:
filter(df, !is.na(x))
# hflights is at your service as a tbl!
# All flights that departed before 5am or arrived after 10pm
filter(hflights, DepTime < 500 | ArrTime > 2200)
# A tibble: 27,799 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 4 2 2100 2207 AA 533
2 2011 1 14 5 2119 2229 AA 533
3 2011 1 10 1 1934 2235 AA 1294
4 2011 1 26 3 1905 2211 AA 1294
5 2011 1 30 7 1856 2209 AA 1294
6 2011 1 9 7 1938 2228 AS 731
7 2011 1 31 1 1919 2231 CO 190
8 2011 1 31 1 2116 2344 CO 209
9 2011 1 31 1 1850 2211 CO 250
10 2011 1 31 1 2102 2216 CO 299
# ... with 27,789 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# All flights that departed late but arrived ahead of schedule
filter(hflights, DepDelay > 0, ArrDelay < 0)
# A tibble: 27,712 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 2 7 1401 1501 AA 428
2 2011 1 5 3 1405 1507 AA 428
3 2011 1 18 2 1408 1508 AA 428
4 2011 1 18 2 721 827 AA 460
5 2011 1 12 3 2015 2113 AA 533
6 2011 1 13 4 2020 2116 AA 533
7 2011 1 26 3 2009 2103 AA 533
8 2011 1 1 6 1631 1736 AA 1121
9 2011 1 10 1 1639 1740 AA 1121
10 2011 1 12 3 1631 1739 AA 1121
# ... with 27,702 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# All flights that were cancelled after being delayed
filter(hflights, Cancelled == 1, DepDelay > 0)
# A tibble: 40 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 26 3 1926 NA CO 310
2 2011 1 11 2 1100 NA US 944
3 2011 1 19 3 1811 NA XE 2376
4 2011 1 7 5 2028 NA XE 3050
5 2011 2 4 5 1638 NA AA 1121
6 2011 2 8 2 1057 NA CO 408
7 2011 2 2 3 802 NA XE 2189
8 2011 2 9 3 904 NA XE 2605
9 2011 2 1 2 1508 NA OO 5812
10 2011 3 31 4 1016 NA CO 586
# ... with 30 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
So far, you have learned three data manipulation functions in the dplyr package. Time for a summarizing exercise. You will generate a new dataset from the hflights dataset that contains some useful information on flights that had JFK airport as their destination. You will need select(), mutate() and filter().
# hflights is already available in the workspace
# Select the flights that had JFK as their destination: c1
c1 <- filter(hflights, Dest == "JFK")
# Combine the Year, Month and DayofMonth variables to create a Date column: c2
c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))
# Print out a selection of columns of c2
select(c2, Date, DepTime, ArrTime, TailNum)
# A tibble: 695 x 4
Date DepTime ArrTime TailNum
<chr> <int> <int> <chr>
1 2011-1-1 654 1124 N324JB
2 2011-1-1 1639 2110 N324JB
3 2011-1-2 703 1113 N324JB
4 2011-1-2 1604 2040 N324JB
5 2011-1-3 659 1100 N229JB
6 2011-1-3 1801 2200 N206JB
7 2011-1-4 654 1103 N267JB
8 2011-1-4 1608 2034 N267JB
9 2011-1-5 700 1103 N708JB
10 2011-1-5 1544 1954 N644JB
# ... with 685 more rows
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, for example, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).
dtc has already been defined on the right. It’s up to you to write some arrange() expressions to display its contents appropriately!
# dplyr and the hflights tbl are available
# Definition of dtc
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))
# Arrange dtc by departure delays
arrange(dtc, DepDelay)
# A tibble: 68 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 7 23 6 605 NA F9 225
2 2011 1 17 1 916 NA XE 3068
3 2011 12 1 4 541 NA US 282
4 2011 10 12 3 2022 NA MQ 3724
5 2011 7 29 5 1424 NA CO 1079
6 2011 9 29 4 1639 NA OO 2062
7 2011 2 9 3 555 NA MQ 3265
8 2011 5 9 1 715 NA OO 1177
9 2011 1 20 4 1413 NA UA 552
10 2011 1 17 1 831 NA WN 1
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# Arrange dtc so that cancellation reasons are grouped
arrange(dtc, CancellationCode)
# A tibble: 68 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 1 20 4 1413 NA UA 552
2 2011 1 7 5 2028 NA XE 3050
3 2011 2 4 5 1638 NA AA 1121
4 2011 2 8 2 1057 NA CO 408
5 2011 2 1 2 1508 NA OO 5812
6 2011 2 21 1 2257 NA OO 1111
7 2011 2 9 3 555 NA MQ 3265
8 2011 3 18 5 727 NA UA 109
9 2011 4 4 1 1632 NA DL 8
10 2011 4 8 5 1608 NA WN 4
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# Arrange dtc according to carrier and departure delays
arrange(dtc, UniqueCarrier, DepDelay)
# A tibble: 68 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 8 18 4 1808 NA AA 1294
2 2011 2 4 5 1638 NA AA 1121
3 2011 7 29 5 1424 NA CO 1079
4 2011 1 26 3 1703 NA CO 410
5 2011 8 11 4 1320 NA CO 1669
6 2011 7 25 1 1654 NA CO 1422
7 2011 1 26 3 1926 NA CO 310
8 2011 3 31 4 1016 NA CO 586
9 2011 2 8 2 1057 NA CO 408
10 2011 4 4 1 1632 NA DL 8
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().
# dplyr and the hflights tbl are available
# Arrange according to carrier and decreasing departure delays
arrange(hflights, UniqueCarrier, desc(DepDelay))
# A tibble: 227,496 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 12 12 1 650 808 AA 1740
2 2011 11 19 6 1752 1910 AA 1903
3 2011 12 22 4 1728 1848 AA 1903
4 2011 10 23 7 2305 2 AA 742
5 2011 9 27 2 1206 1300 AA 1948
6 2011 3 17 4 1647 1747 AA 1505
7 2011 6 21 2 955 1315 AA 466
8 2011 5 20 5 2359 130 AA 426
9 2011 4 19 2 2023 2142 AA 1925
10 2011 5 12 4 2133 53 AA 1294
# ... with 227,486 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
# Arrange flights by total delay (normal order).
arrange(hflights, DepDelay + ArrDelay)
# A tibble: 227,496 x 22
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
<int> <int> <int> <int> <int> <int> <chr> <int>
1 2011 7 3 7 1914 2039 XE 2804
2 2011 8 31 3 934 1039 OO 2040
3 2011 8 21 7 935 1039 OO 2001
4 2011 8 28 7 2059 2206 OO 2003
5 2011 8 29 1 935 1041 OO 2040
6 2011 12 25 7 741 926 OO 4591
7 2011 1 30 7 620 812 OO 4461
8 2011 8 3 3 1741 1810 XE 2603
9 2011 8 4 4 930 1041 OO 1171
10 2011 8 18 4 939 1043 OO 2001
# ... with 227,486 more rows, and 14 more variables: TailNum <chr>,
# ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
# DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
# TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
# Diverted <int>, Carrier <chr>
summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().
In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.
# hflights and dplyr are loaded in the workspace
# Print out a summary with variables min_dist and max_dist
summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))
# A tibble: 1 x 2
min_dist max_dist
<dbl> <dbl>
1 79 3904
# Print out a summary with variable max_div
summarize(filter(hflights, Diverted == 1), max_div = max(Distance))
# A tibble: 1 x 1
max_div
<dbl>
1 3904
You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:
min(x)
- minimum value of vector x.max(x)
- maximum value of vector x.mean(x)
- mean value of vector x.median(x)
- median value of vector x.quantile(x, p)
- pth quantile of vector x.sd(x)
- standard deviation of vector x.var(x)
- variance of vector x.IQR(x)
- Inter Quartile Range (IQR) of vector x.diff(range(x))
- total range of vector x.# hflights is available
# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))
# Generate summary about ArrDelay column of temp1
summarize(temp1,
earliest = min(ArrDelay),
average = mean(ArrDelay),
latest = max(ArrDelay),
sd = sd(ArrDelay))
# A tibble: 1 x 4
earliest average latest sd
<dbl> <dbl> <dbl> <dbl>
1 -70 7.09 978 30.7
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))
# Print the maximum taxiing difference of temp2 with summarize()
summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
# A tibble: 1 x 1
max_taxi_diff
<int>
1 160
dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
first(x)
- The first element of vector x.last(x)
- The last element of vector x.nth(x, n)
- The nth element of vector x.n()
- The number of rows in the data.frame or group of observations that summarize() describes.n_distinct(x)
- The number of unique values in vector x. Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.# hflights is available with full names for the carriers
# Generate summarizing statistics for hflights
summarize(hflights,
n_obs = n(),
n_carrier = n_distinct(UniqueCarrier),
n_dest = n_distinct(Dest))
# A tibble: 1 x 3
n_obs n_carrier n_dest
<int> <int> <int>
1 227496 15 116
# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")
# Generate summarizing statistics for aa
summarize(aa,
n_flights = n(),
n_canc = sum(Cancelled == 1),
avg_delay = mean(ArrDelay, na.rm = TRUE))
# A tibble: 1 x 3
n_flights n_canc avg_delay
<int> <int> <dbl>
1 0 0 NaN
As another example of the %>%, have a look at the following two commands that are completely equivalent:
mean(c(1, 2, 3, NA), na.rm = TRUE)
[1] 2
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
[1] 2
The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.
# hflights and dplyr are both loaded and ready to serve you
# Write the 'piped' version of the English sentences.
hflights %>%
mutate(diff = TaxiOut - TaxiIn) %>%
filter(!is.na(diff)) %>%
summarize(avg = mean(diff))
# A tibble: 1 x 1
avg
<dbl>
1 8.99
You can answer sophisticated questions by combining the verbs of dplyr. Over the next few exercises you will examine whether it sometimes makes sense to drive instead of fly. You will begin by making a data set that contains relevant variables. Then, you will find flights whose equivalent average velocity is lower than the velocity when traveling by car.
In the following instructions, you have to carry out a series of dplyr verbs on the hflights dataset. Make sure to use the %>% operator to chain them all together.
# Chain together mutate(), filter() and summarize()
hflights %>%
mutate(
RealTime = ActualElapsedTime + 100,
mph = 60 * Distance / RealTime
) %>%
filter(!is.na(mph), mph < 70) %>%
summarize(
n_less = n(),
n_dest = n_distinct(Dest),
min_dist = min(Distance),
max_dist = max(Distance)
)
# A tibble: 1 x 4
n_less n_dest min_dist max_dist
<int> <int> <dbl> <dbl>
1 6726 13 79 305
The previous exercise suggested that some flights might be less efficient than driving in terms of speed. But is speed all that matters? Flying imposes burdens on a traveler that driving does not. For example, airplane tickets are very expensive. Air travelers also need to limit what they bring on their trip and arrange for a pick up or a drop off. Given these burdens we might demand that a flight provide a large speed advantage over driving.
Let’s define preferable flights as flights that are at least 50% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.
The mutate() call from the previous exercise is already coded up; can you add more pipes and verbs to the command?
# Finish the command with a filter() and summarize() call
hflights %>%
mutate(
RealTime = ActualElapsedTime + 100,
mph = 60 * Distance / RealTime
) %>%
filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
summarize(
n_non = n(),
n_dest = n_distinct(Dest),
min_dist = min(Distance),
max_dist = max(Distance)
)
# A tibble: 1 x 4
n_non n_dest min_dist max_dist
<int> <int> <dbl> <dbl>
1 42400 113 79 3904
Let’s use hflights to answer another question: How many flights were overnight flights?
# hflights and dplyr are loaded
# Count the number of overnight flights
hflights %>%
filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
summarize(num = n())
# A tibble: 1 x 1
num
<int>
1 2718
As Garrett explained, group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.
In this exercise, you are going to create an ordered per-carrier summary of hflights by combining group_by(), summarize() and arrange()
# hflights is in the workspace as a tbl, with translated carrier names
# Make an ordered per-carrier summary of hflights
hflights %>%
group_by(UniqueCarrier) %>%
summarize(
p_canc = 100 * mean(Cancelled == 1),
avg_delay = mean(ArrDelay, na.rm = TRUE)
) %>%
arrange(avg_delay, p_canc)
# A tibble: 15 x 3
UniqueCarrier p_canc avg_delay
<chr> <dbl> <dbl>
1 US 1.13 -0.631
2 AA 1.85 0.892
3 FL 0.982 1.85
4 AS 0 3.19
5 YV 1.27 4.01
6 DL 1.59 6.08
7 CO 0.678 6.10
8 MQ 2.90 7.15
9 EV 3.45 7.26
10 WN 1.55 7.59
11 F9 0.716 7.67
12 XE 1.55 8.19
13 OO 1.39 8.69
14 B6 2.59 9.86
15 UA 1.64 10.5
You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.
rank(c(21, 22, 24, 23))
has output
[1] 1 2 4 3
As with arrange(), rank() ranks values from the smallest to the largest.
# dplyr is loaded, hflights is loaded with translated carrier names
# Ordered overview of average arrival delays per carrier
hflights %>%
filter(!is.na(ArrDelay), ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
summarize(avg = mean(ArrDelay)) %>%
mutate(rank = rank(avg)) %>%
arrange(rank)
# A tibble: 15 x 3
UniqueCarrier avg rank
<chr> <dbl> <dbl>
1 YV 18.7 1
2 F9 18.7 2
3 US 20.7 3
4 CO 22.1 4
5 AS 22.9 5
6 OO 24.1 6
7 XE 24.2 7
8 WN 25.3 8
9 FL 27.9 9
10 AA 28.5 10
11 DL 32.1 11
12 UA 32.5 12
13 MQ 38.8 13
14 EV 40.2 14
15 B6 45.5 15
By now you’ve learned the fundamentals of dplyr: the five data manipulation verbs and the additional group_by() function to discover interesting group-wise statistics. The next challenges are an all-encompassing review of the concepts you have learned about. We already provided you with a template of the piped call that can solve the exercises. Up to you to finish all dplyr calls! For simplicity, you can include cancelled flights in your answers, so you shouldn’t filter based on the Cancelled column.
# dplyr and hflights (with translated carrier names) are pre-loaded
# How many airplanes only flew to one destination?
hflights %>%
group_by(TailNum) %>%
summarize(ndest = n_distinct(Dest)) %>%
filter(ndest == 1) %>%
summarize(nplanes = n())
# A tibble: 1 x 1
nplanes
<int>
1 1526
# Find the most visited destination for each carrier
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarize(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
# A tibble: 15 x 4
# Groups: UniqueCarrier [15]
UniqueCarrier Dest n rank
<chr> <chr> <int> <dbl>
1 AA DFW 2105 1
2 AS SEA 365 1
3 B6 JFK 695 1
4 CO EWR 3924 1
5 DL ATL 2396 1
6 EV DTW 851 1
7 F9 DEN 837 1
8 FL ATL 2029 1
9 MQ DFW 2424 1
10 OO COS 1335 1
11 UA SFO 643 1
12 US CLT 2212 1
13 WN DAL 8243 1
14 XE CRP 3175 1
15 YV CLT 71 1
hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:
library(data.table)
hflights2 <- as.data.table(hflights)
hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.
Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.
# hflights2 is pre-loaded as a data.table
# Use summarize to calculate n_carrier
hflights2 %>% summarize(n_carrier = n_distinct(UniqueCarrier))
n_carrier
1 15
DataCamp hosts a mySQL database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we already created a reference to a table in this information.
Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.
library(RMySQL)
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")
# glimpse at nycflights
glimpse(nycflights)
Observations: ??
Variables: 17
Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
$ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
$ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
$ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
$ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
$ hour <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
$ minute <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...