Data Manipulation in R with dplyr

Load the dplyr and hflights package

Welcome to the interactive exercises part of your dplyr course. Here you will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.

Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.

Both dplyr and hflights are already installed on DataCamp’s servers, so loading them with library() will get you up and running.

# Load the dplyr package
library(dplyr)

# Load the hflights package
library(hflights)

Convert data.frame to tibble

As Garrett explained, a tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble().

The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.

# Both the dplyr and hflights packages are loaded

# Convert the hflights data.frame into a hflights tbl
hflights <- as_tibble(hflights)

# Display the hflights tbl
hflights
# A tibble: 227,496 x 21
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1          1         6    1400    1500 AA                  428
 2  2011     1          2         7    1401    1501 AA                  428
 3  2011     1          3         1    1352    1502 AA                  428
 4  2011     1          4         2    1403    1513 AA                  428
 5  2011     1          5         3    1405    1507 AA                  428
 6  2011     1          6         4    1359    1503 AA                  428
 7  2011     1          7         5    1359    1509 AA                  428
 8  2011     1          8         6    1355    1454 AA                  428
 9  2011     1          9         7    1443    1554 AA                  428
10  2011     1         10         1    1443    1553 AA                  428
# ... with 227,486 more rows, and 13 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>
# Create the object carriers
carriers <- hflights$UniqueCarrier

Changing labels of hflights

A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.

To do this, let’s work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:

two <- c("AA", "AS")
lut <- c("AA" = "American", 
         "AS" = "Alaska", 
         "B6" = "JetBlue")
two <- lut[two]
two
        AA         AS 
"American"   "Alaska" 
# Both the dplyr and hflights packages are loaded into workspace
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

# Add the Carrier column to hflights
hflights$Carrier <- lut[hflights$UniqueCarrier]

# Glimpse at hflights
glimpse(hflights)
Observations: 227,496
Variables: 22
$ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
$ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
$ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
$ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
$ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
$ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
$ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
$ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
$ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
$ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
$ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
$ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
$ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
$ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
$ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
$ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
$ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
$ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ Carrier           <chr> "American", "American", "American", "America...

Let’s try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute

unique(hflights$CancellationCode)
[1] ""  "A" "B" "C" "D"

A lookup table lut has already been created for you, that converts the alphabetical codes into more meaningful strings.

# The hflights tbl you built in the previous exercise is available in the workspace.

# The lookup table
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")

# Add the Code column
# hflights$Code <- lut[hflights$CancellationCode]

# Glimpse at hflights
# glimpse(hflights)

Select

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().

As an example, take the following call, that selects the variables var1 and var2 from the data frame df.

select(df, var1, var2)

You can also use : to select a range of variables and - to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable’s names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:

select(df, 1:4, -2)

select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.

# hflights is pre-loaded as a tbl, together with the necessary libraries.

# Print out a tbl with the four columns of hflights related to delay
select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)
# A tibble: 227,496 x 4
   ActualElapsedTime AirTime ArrDelay DepDelay
               <int>   <int>    <int>    <int>
 1                60      40      -10        0
 2                60      45       -9        1
 3                70      48       -8       -8
 4                70      39        3        3
 5                62      44       -3        5
 6                64      45       -7       -1
 7                70      43       -1       -1
 8                59      40      -16       -5
 9                71      41       44       43
10                70      45       43       43
# ... with 227,486 more rows
# Print out the columns Origin up to Cancelled of hflights
select(hflights, Origin:Cancelled)
# A tibble: 227,496 x 6
   Origin Dest  Distance TaxiIn TaxiOut Cancelled
   <chr>  <chr>    <int>  <int>   <int>     <int>
 1 IAH    DFW        224      7      13         0
 2 IAH    DFW        224      6       9         0
 3 IAH    DFW        224      5      17         0
 4 IAH    DFW        224      9      22         0
 5 IAH    DFW        224      9       9         0
 6 IAH    DFW        224      6      13         0
 7 IAH    DFW        224     12      15         0
 8 IAH    DFW        224      7      12         0
 9 IAH    DFW        224      8      22         0
10 IAH    DFW        224      6      19         0
# ... with 227,486 more rows
# Answer to last question: be concise!
select(hflights, - (DepTime:AirTime))
# A tibble: 227,496 x 15
    Year Month DayofMonth DayOfWeek ArrDelay DepDelay Origin Dest  Distance
   <int> <int>      <int>     <int>    <int>    <int> <chr>  <chr>    <int>
 1  2011     1          1         6      -10        0 IAH    DFW        224
 2  2011     1          2         7       -9        1 IAH    DFW        224
 3  2011     1          3         1       -8       -8 IAH    DFW        224
 4  2011     1          4         2        3        3 IAH    DFW        224
 5  2011     1          5         3       -3        5 IAH    DFW        224
 6  2011     1          6         4       -7       -1 IAH    DFW        224
 7  2011     1          7         5       -1       -1 IAH    DFW        224
 8  2011     1          8         6      -16       -5 IAH    DFW        224
 9  2011     1          9         7       44       43 IAH    DFW        224
10  2011     1         10         1       43       43 IAH    DFW        224
# ... with 227,486 more rows, and 6 more variables: TaxiIn <int>,
#   TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>

Functions inside select()

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

  • starts_with("X"): every name that starts with “X”,
  • ends_with("X"): every name that ends with “X”,
  • contains("X"): every name that contains “X”,
  • matches("X"): every name that matches “X”, where “X” can be a regular expression,
  • num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
  • one_of(x): every name that appears in x, which should be a character vector.

Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.

# As usual, hflights is pre-loaded as a tbl, together with the necessary libraries.

# Print out a tbl containing just ArrDelay and DepDelay
select(hflights, ends_with("Delay"))
# A tibble: 227,496 x 2
   ArrDelay DepDelay
      <int>    <int>
 1      -10        0
 2       -9        1
 3       -8       -8
 4        3        3
 5       -3        5
 6       -7       -1
 7       -1       -1
 8      -16       -5
 9       44       43
10       43       43
# ... with 227,486 more rows
# Print out a tbl as described in the second instruction, using both helper functions and variable names
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
# A tibble: 227,496 x 5
   UniqueCarrier FlightNum TailNum Cancelled CancellationCode
   <chr>             <int> <chr>       <int> <chr>           
 1 AA                  428 N576AA          0 ""              
 2 AA                  428 N557AA          0 ""              
 3 AA                  428 N541AA          0 ""              
 4 AA                  428 N403AA          0 ""              
 5 AA                  428 N492AA          0 ""              
 6 AA                  428 N262AA          0 ""              
 7 AA                  428 N493AA          0 ""              
 8 AA                  428 N477AA          0 ""              
 9 AA                  428 N476AA          0 ""              
10 AA                  428 N504AA          0 ""              
# ... with 227,486 more rows
# Print out tbl as described in the third instruction, using only helper functions.
select(hflights, contains("Tim"), contains("Del"))
# A tibble: 227,496 x 6
   DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
     <int>   <int>             <int>   <int>    <int>    <int>
 1    1400    1500                60      40      -10        0
 2    1401    1501                60      45       -9        1
 3    1352    1502                70      48       -8       -8
 4    1403    1513                70      39        3        3
 5    1405    1507                62      44       -3        5
 6    1359    1503                64      45       -7       -1
 7    1359    1509                70      43       -1       -1
 8    1355    1454                59      40      -16       -5
 9    1443    1554                71      41       44       43
10    1443    1553                70      45       43       43
# ... with 227,486 more rows

Comparison to base R

# both hflights and dplyr are available

# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, contains("Taxi"), Distance)

# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, starts_with("T"))

Mutate

mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.

Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:

mutate(df, z = x + y)
# hflights and dplyr are loaded and ready to serve you.

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = 60 * Distance / AirTime)

# Print out g3
g3
# A tibble: 227,496 x 25
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1          1         6    1400    1500 AA                  428
 2  2011     1          2         7    1401    1501 AA                  428
 3  2011     1          3         1    1352    1502 AA                  428
 4  2011     1          4         2    1403    1513 AA                  428
 5  2011     1          5         3    1405    1507 AA                  428
 6  2011     1          6         4    1359    1503 AA                  428
 7  2011     1          7         5    1359    1509 AA                  428
 8  2011     1          8         6    1355    1454 AA                  428
 9  2011     1          9         7    1443    1554 AA                  428
10  2011     1         10         1    1443    1553 AA                  428
# ... with 227,486 more rows, and 17 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>, ActualGroundTime <int>,
#   GroundTime <int>, AverageSpeed <dbl>

Adding multiple variables

So far you’ve added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

mutate(my_df, x = a + b, y = x + c)
# hflights and dplyr are ready, are you?

# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)

# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, 
             ActualGroundTime = ActualElapsedTime - AirTime,
             Diff = TotalTaxi - ActualGroundTime)

Filter

Logical operators R comes with a set of logical operators that you can use inside filter():

  • x < y, TRUE if x is less than y
  • x <= y, TRUE if x is less than or equal to y
  • x == y, TRUE if x equals y
  • x != y, TRUE if x does not equal y
  • x >= y, TRUE if x is greater than or equal to y
  • x > y, TRUE if x is greater than y
  • x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following example filters df such that only the observations for which a is positive, are kept:

filter(df, a > 0)
# hflights is at your disposal as a tbl, with clean carrier names

# All flights that traveled 3000 miles or more
filter(hflights, Distance >= 3000)
# A tibble: 527 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1         31         1     924    1413 CO                    1
 2  2011     1         30         7     925    1410 CO                    1
 3  2011     1         29         6    1045    1445 CO                    1
 4  2011     1         28         5    1516    1916 CO                    1
 5  2011     1         27         4     950    1344 CO                    1
 6  2011     1         26         3     944    1350 CO                    1
 7  2011     1         25         2     924    1337 CO                    1
 8  2011     1         24         1    1144    1605 CO                    1
 9  2011     1         23         7     926    1335 CO                    1
10  2011     1         22         6     942    1340 CO                    1
# ... with 517 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# All flights flown by JetBlue, Southwest, or Delta
filter(hflights, UniqueCarrier %in% c("JetBlue", "Southwest", "Delta"))
# A tibble: 0 x 22
# ... with 22 variables: Year <int>, Month <int>, DayofMonth <int>,
#   DayOfWeek <int>, DepTime <int>, ArrTime <int>, UniqueCarrier <chr>,
#   FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>,
#   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
#   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
#   Cancelled <int>, CancellationCode <chr>, Diverted <int>, Carrier <chr>
# All flights where taxiing took longer than flying
filter(hflights, TaxiIn + TaxiOut > AirTime)
# A tibble: 1,389 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1         24         1     731     904 AA                  460
 2  2011     1         30         7    1959    2132 AA                  533
 3  2011     1         24         1    1621    1749 AA                 1121
 4  2011     1         10         1     941    1113 AA                 1436
 5  2011     1         31         1    1301    1356 CO                  241
 6  2011     1         31         1    2113    2215 CO                 1533
 7  2011     1         31         1    1434    1539 CO                 1541
 8  2011     1         31         1     900    1006 CO                 1583
 9  2011     1         30         7    1304    1408 CO                  241
10  2011     1         30         7    2004    2128 CO                  423
# ... with 1,379 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>

Combining tests using boolean operators

R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:

filter(df, a > 0 & b > 0)
filter(df, a > 0, b > 0)

Next, is.na() will also come in handy. This example keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))
# hflights is at your service as a tbl!

# All flights that departed before 5am or arrived after 10pm
filter(hflights, DepTime < 500 | ArrTime > 2200)
# A tibble: 27,799 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1          4         2    2100    2207 AA                  533
 2  2011     1         14         5    2119    2229 AA                  533
 3  2011     1         10         1    1934    2235 AA                 1294
 4  2011     1         26         3    1905    2211 AA                 1294
 5  2011     1         30         7    1856    2209 AA                 1294
 6  2011     1          9         7    1938    2228 AS                  731
 7  2011     1         31         1    1919    2231 CO                  190
 8  2011     1         31         1    2116    2344 CO                  209
 9  2011     1         31         1    1850    2211 CO                  250
10  2011     1         31         1    2102    2216 CO                  299
# ... with 27,789 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# All flights that departed late but arrived ahead of schedule
filter(hflights, DepDelay > 0, ArrDelay < 0)
# A tibble: 27,712 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1          2         7    1401    1501 AA                  428
 2  2011     1          5         3    1405    1507 AA                  428
 3  2011     1         18         2    1408    1508 AA                  428
 4  2011     1         18         2     721     827 AA                  460
 5  2011     1         12         3    2015    2113 AA                  533
 6  2011     1         13         4    2020    2116 AA                  533
 7  2011     1         26         3    2009    2103 AA                  533
 8  2011     1          1         6    1631    1736 AA                 1121
 9  2011     1         10         1    1639    1740 AA                 1121
10  2011     1         12         3    1631    1739 AA                 1121
# ... with 27,702 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# All flights that were cancelled after being delayed
filter(hflights, Cancelled == 1, DepDelay > 0)
# A tibble: 40 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1         26         3    1926      NA CO                  310
 2  2011     1         11         2    1100      NA US                  944
 3  2011     1         19         3    1811      NA XE                 2376
 4  2011     1          7         5    2028      NA XE                 3050
 5  2011     2          4         5    1638      NA AA                 1121
 6  2011     2          8         2    1057      NA CO                  408
 7  2011     2          2         3     802      NA XE                 2189
 8  2011     2          9         3     904      NA XE                 2605
 9  2011     2          1         2    1508      NA OO                 5812
10  2011     3         31         4    1016      NA CO                  586
# ... with 30 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>

General exercise

So far, you have learned three data manipulation functions in the dplyr package. Time for a summarizing exercise. You will generate a new dataset from the hflights dataset that contains some useful information on flights that had JFK airport as their destination. You will need select(), mutate() and filter().

# hflights is already available in the workspace

# Select the flights that had JFK as their destination: c1
c1 <- filter(hflights, Dest == "JFK")

# Combine the Year, Month and DayofMonth variables to create a Date column: c2
c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-"))

# Print out a selection of columns of c2
select(c2, Date, DepTime, ArrTime, TailNum)
# A tibble: 695 x 4
   Date     DepTime ArrTime TailNum
   <chr>      <int>   <int> <chr>  
 1 2011-1-1     654    1124 N324JB 
 2 2011-1-1    1639    2110 N324JB 
 3 2011-1-2     703    1113 N324JB 
 4 2011-1-2    1604    2040 N324JB 
 5 2011-1-3     659    1100 N229JB 
 6 2011-1-3    1801    2200 N206JB 
 7 2011-1-4     654    1103 N267JB 
 8 2011-1-4    1608    2034 N267JB 
 9 2011-1-5     700    1103 N708JB 
10 2011-1-5    1544    1954 N644JB 
# ... with 685 more rows

Arrange

arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, for example, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

dtc has already been defined on the right. It’s up to you to write some arrange() expressions to display its contents appropriately!

# dplyr and the hflights tbl are available

# Definition of dtc
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

# Arrange dtc by departure delays
arrange(dtc, DepDelay)
# A tibble: 68 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     7         23         6     605      NA F9                  225
 2  2011     1         17         1     916      NA XE                 3068
 3  2011    12          1         4     541      NA US                  282
 4  2011    10         12         3    2022      NA MQ                 3724
 5  2011     7         29         5    1424      NA CO                 1079
 6  2011     9         29         4    1639      NA OO                 2062
 7  2011     2          9         3     555      NA MQ                 3265
 8  2011     5          9         1     715      NA OO                 1177
 9  2011     1         20         4    1413      NA UA                  552
10  2011     1         17         1     831      NA WN                    1
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# Arrange dtc so that cancellation reasons are grouped
arrange(dtc, CancellationCode)
# A tibble: 68 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     1         20         4    1413      NA UA                  552
 2  2011     1          7         5    2028      NA XE                 3050
 3  2011     2          4         5    1638      NA AA                 1121
 4  2011     2          8         2    1057      NA CO                  408
 5  2011     2          1         2    1508      NA OO                 5812
 6  2011     2         21         1    2257      NA OO                 1111
 7  2011     2          9         3     555      NA MQ                 3265
 8  2011     3         18         5     727      NA UA                  109
 9  2011     4          4         1    1632      NA DL                    8
10  2011     4          8         5    1608      NA WN                    4
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# Arrange dtc according to carrier and departure delays
arrange(dtc, UniqueCarrier, DepDelay)
# A tibble: 68 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     8         18         4    1808      NA AA                 1294
 2  2011     2          4         5    1638      NA AA                 1121
 3  2011     7         29         5    1424      NA CO                 1079
 4  2011     1         26         3    1703      NA CO                  410
 5  2011     8         11         4    1320      NA CO                 1669
 6  2011     7         25         1    1654      NA CO                 1422
 7  2011     1         26         3    1926      NA CO                  310
 8  2011     3         31         4    1016      NA CO                  586
 9  2011     2          8         2    1057      NA CO                  408
10  2011     4          4         1    1632      NA DL                    8
# ... with 58 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().

# dplyr and the hflights tbl are available

# Arrange according to carrier and decreasing departure delays
arrange(hflights, UniqueCarrier, desc(DepDelay))
# A tibble: 227,496 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011    12         12         1     650     808 AA                 1740
 2  2011    11         19         6    1752    1910 AA                 1903
 3  2011    12         22         4    1728    1848 AA                 1903
 4  2011    10         23         7    2305       2 AA                  742
 5  2011     9         27         2    1206    1300 AA                 1948
 6  2011     3         17         4    1647    1747 AA                 1505
 7  2011     6         21         2     955    1315 AA                  466
 8  2011     5         20         5    2359     130 AA                  426
 9  2011     4         19         2    2023    2142 AA                 1925
10  2011     5         12         4    2133      53 AA                 1294
# ... with 227,486 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>
# Arrange flights by total delay (normal order).
arrange(hflights, DepDelay + ArrDelay)
# A tibble: 227,496 x 22
    Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
   <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
 1  2011     7          3         7    1914    2039 XE                 2804
 2  2011     8         31         3     934    1039 OO                 2040
 3  2011     8         21         7     935    1039 OO                 2001
 4  2011     8         28         7    2059    2206 OO                 2003
 5  2011     8         29         1     935    1041 OO                 2040
 6  2011    12         25         7     741     926 OO                 4591
 7  2011     1         30         7     620     812 OO                 4461
 8  2011     8          3         3    1741    1810 XE                 2603
 9  2011     8          4         4     930    1041 OO                 1171
10  2011     8         18         4     939    1043 OO                 2001
# ... with 227,486 more rows, and 14 more variables: TailNum <chr>,
#   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
#   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
#   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
#   Diverted <int>, Carrier <chr>

Summarize

summarize(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarize() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

# hflights and dplyr are loaded in the workspace

# Print out a summary with variables min_dist and max_dist
summarize(hflights, min_dist = min(Distance), max_dist = max(Distance))
# A tibble: 1 x 2
  min_dist max_dist
     <dbl>    <dbl>
1       79     3904
# Print out a summary with variable max_div
summarize(filter(hflights, Diverted == 1), max_div = max(Distance))
# A tibble: 1 x 1
  max_div
    <dbl>
1    3904

You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

  • min(x) - minimum value of vector x.
  • max(x) - maximum value of vector x.
  • mean(x) - mean value of vector x.
  • median(x) - median value of vector x.
  • quantile(x, p) - pth quantile of vector x.
  • sd(x) - standard deviation of vector x.
  • var(x) - variance of vector x.
  • IQR(x) - Inter Quartile Range (IQR) of vector x.
  • diff(range(x)) - total range of vector x.
# hflights is available

# Remove rows that have NA ArrDelay: temp1
temp1 <- filter(hflights, !is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarize(temp1, 
          earliest = min(ArrDelay), 
          average = mean(ArrDelay), 
          latest = max(ArrDelay), 
          sd = sd(ArrDelay))
# A tibble: 1 x 4
  earliest average latest    sd
     <dbl>   <dbl>  <dbl> <dbl>
1      -70    7.09    978  30.7
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarize()
summarize(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
# A tibble: 1 x 1
  max_taxi_diff
          <int>
1           160

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

  • first(x) - The first element of vector x.
  • last(x) - The last element of vector x.
  • nth(x, n) - The nth element of vector x.
  • n() - The number of rows in the data.frame or group of observations that summarize() describes.
  • n_distinct(x) - The number of unique values in vector x. Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.
# hflights is available with full names for the carriers

# Generate summarizing statistics for hflights
summarize(hflights, 
          n_obs = n(), 
          n_carrier = n_distinct(UniqueCarrier), 
          n_dest = n_distinct(Dest))
# A tibble: 1 x 3
   n_obs n_carrier n_dest
   <int>     <int>  <int>
1 227496        15    116
# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")

# Generate summarizing statistics for aa 
summarize(aa, 
          n_flights = n(), 
          n_canc = sum(Cancelled == 1),
          avg_delay = mean(ArrDelay, na.rm = TRUE))
# A tibble: 1 x 3
  n_flights n_canc avg_delay
      <int>  <int>     <dbl>
1         0      0       NaN

Pipe

As another example of the %>%, have a look at the following two commands that are completely equivalent:

mean(c(1, 2, 3, NA), na.rm = TRUE)
[1] 2
c(1, 2, 3, NA) %>% mean(na.rm = TRUE)
[1] 2

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

# hflights and dplyr are both loaded and ready to serve you

# Write the 'piped' version of the English sentences.
hflights %>%
  mutate(diff = TaxiOut - TaxiIn) %>%
  filter(!is.na(diff)) %>%
  summarize(avg = mean(diff))
# A tibble: 1 x 1
    avg
  <dbl>
1  8.99

You can answer sophisticated questions by combining the verbs of dplyr. Over the next few exercises you will examine whether it sometimes makes sense to drive instead of fly. You will begin by making a data set that contains relevant variables. Then, you will find flights whose equivalent average velocity is lower than the velocity when traveling by car.

In the following instructions, you have to carry out a series of dplyr verbs on the hflights dataset. Make sure to use the %>% operator to chain them all together.

# Chain together mutate(), filter() and summarize()
hflights %>%
  mutate(
    RealTime = ActualElapsedTime + 100, 
    mph = 60 * Distance / RealTime
  ) %>%
  filter(!is.na(mph), mph < 70) %>%
  summarize(
    n_less = n(), 
    n_dest = n_distinct(Dest), 
    min_dist = min(Distance), 
    max_dist = max(Distance)
  )
# A tibble: 1 x 4
  n_less n_dest min_dist max_dist
   <int>  <int>    <dbl>    <dbl>
1   6726     13       79      305

The previous exercise suggested that some flights might be less efficient than driving in terms of speed. But is speed all that matters? Flying imposes burdens on a traveler that driving does not. For example, airplane tickets are very expensive. Air travelers also need to limit what they bring on their trip and arrange for a pick up or a drop off. Given these burdens we might demand that a flight provide a large speed advantage over driving.

Let’s define preferable flights as flights that are at least 50% faster than driving, i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or diverted flights are less preferable than driving.

The mutate() call from the previous exercise is already coded up; can you add more pipes and verbs to the command?

# Finish the command with a filter() and summarize() call
hflights %>%
  mutate(
    RealTime = ActualElapsedTime + 100, 
    mph = 60 * Distance / RealTime
  ) %>%
  filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
  summarize(
    n_non = n(), 
    n_dest = n_distinct(Dest), 
    min_dist = min(Distance), 
    max_dist = max(Distance)
  )
# A tibble: 1 x 4
  n_non n_dest min_dist max_dist
  <int>  <int>    <dbl>    <dbl>
1 42400    113       79     3904

Let’s use hflights to answer another question: How many flights were overnight flights?

# hflights and dplyr are loaded

# Count the number of overnight flights
hflights %>%
  filter(!is.na(DepTime), !is.na(ArrTime), DepTime > ArrTime) %>%
  summarize(num = n())
# A tibble: 1 x 1
    num
  <int>
1  2718

Group_by

As Garrett explained, group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.

In this exercise, you are going to create an ordered per-carrier summary of hflights by combining group_by(), summarize() and arrange()

# hflights is in the workspace as a tbl, with translated carrier names

# Make an ordered per-carrier summary of hflights
hflights %>%
   group_by(UniqueCarrier) %>%
   summarize(
     p_canc = 100 * mean(Cancelled == 1), 
     avg_delay = mean(ArrDelay, na.rm = TRUE)
   ) %>%
   arrange(avg_delay, p_canc)
# A tibble: 15 x 3
   UniqueCarrier p_canc avg_delay
   <chr>          <dbl>     <dbl>
 1 US             1.13     -0.631
 2 AA             1.85      0.892
 3 FL             0.982     1.85 
 4 AS             0         3.19 
 5 YV             1.27      4.01 
 6 DL             1.59      6.08 
 7 CO             0.678     6.10 
 8 MQ             2.90      7.15 
 9 EV             3.45      7.26 
10 WN             1.55      7.59 
11 F9             0.716     7.67 
12 XE             1.55      8.19 
13 OO             1.39      8.69 
14 B6             2.59      9.86 
15 UA             1.64     10.5  

Combine group_by with mutate

You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the smallest to the largest.

# dplyr is loaded, hflights is loaded with translated carrier names

# Ordered overview of average arrival delays per carrier
hflights %>%
  filter(!is.na(ArrDelay), ArrDelay > 0) %>%
  group_by(UniqueCarrier) %>%
  summarize(avg = mean(ArrDelay)) %>%
  mutate(rank = rank(avg)) %>%
  arrange(rank)
# A tibble: 15 x 3
   UniqueCarrier   avg  rank
   <chr>         <dbl> <dbl>
 1 YV             18.7     1
 2 F9             18.7     2
 3 US             20.7     3
 4 CO             22.1     4
 5 AS             22.9     5
 6 OO             24.1     6
 7 XE             24.2     7
 8 WN             25.3     8
 9 FL             27.9     9
10 AA             28.5    10
11 DL             32.1    11
12 UA             32.5    12
13 MQ             38.8    13
14 EV             40.2    14
15 B6             45.5    15

Advanced group_by exercises

By now you’ve learned the fundamentals of dplyr: the five data manipulation verbs and the additional group_by() function to discover interesting group-wise statistics. The next challenges are an all-encompassing review of the concepts you have learned about. We already provided you with a template of the piped call that can solve the exercises. Up to you to finish all dplyr calls! For simplicity, you can include cancelled flights in your answers, so you shouldn’t filter based on the Cancelled column.

# dplyr and hflights (with translated carrier names) are pre-loaded

# How many airplanes only flew to one destination?
hflights %>%
  group_by(TailNum) %>%
  summarize(ndest = n_distinct(Dest)) %>%
  filter(ndest == 1) %>%
  summarize(nplanes = n())
# A tibble: 1 x 1
  nplanes
    <int>
1    1526
# Find the most visited destination for each carrier
hflights %>% 
  group_by(UniqueCarrier, Dest) %>%
  summarize(n = n()) %>%
  mutate(rank = rank(desc(n))) %>%
  filter(rank == 1)
# A tibble: 15 x 4
# Groups:   UniqueCarrier [15]
   UniqueCarrier Dest      n  rank
   <chr>         <chr> <int> <dbl>
 1 AA            DFW    2105     1
 2 AS            SEA     365     1
 3 B6            JFK     695     1
 4 CO            EWR    3924     1
 5 DL            ATL    2396     1
 6 EV            DTW     851     1
 7 F9            DEN     837     1
 8 FL            ATL    2029     1
 9 MQ            DFW    2424     1
10 OO            COS    1335     1
11 UA            SFO     643     1
12 US            CLT    2212     1
13 WN            DAL    8243     1
14 XE            CRP    3175     1
15 YV            CLT      71     1

Dplyr and databases

dplyr deals with different types

hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

library(data.table)
hflights2 <- as.data.table(hflights)

hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.

# hflights2 is pre-loaded as a data.table

# Use summarize to calculate n_carrier
hflights2 %>% summarize(n_carrier = n_distinct(UniqueCarrier))
  n_carrier
1        15

dplyr and mySQL databases

DataCamp hosts a mySQL database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we already created a reference to a table in this information.

Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.

library(RMySQL)
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
                   host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "student",
                   password = "datacamp")

# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
glimpse(nycflights)
Observations: ??
Variables: 17
Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
$ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
$ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
$ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
$ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
$ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
$ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
$ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
$ tailnum   <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
$ flight    <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
$ origin    <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
$ dest      <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
$ air_time  <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
$ distance  <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
$ hour      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
$ minute    <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...