I am following this document, Introduction to dplyr

When working with data you must,

Data hflights

Let’s explore this dataset,

# read the fucking manual
# ?hflights

# just figured this dataset was available as a standalone package
# install.packages("hflights")
library(hflights)

# now read the fucking manual
# ?hflights
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
dim(hflights)
## [1] 227496     21

Moving on with the document. tble_df() comes in handy as a wrapper around a data frame that won’t accidentally print a lot of stuff.

hflights_df <- tbl_df(hflights)
hflights_df
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## *  <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

dplyr verbs

filter()

Works the same way as subset() but more intuitive to use. To filter all the flights on January 1st,

filter(hflights_df, Month == 1, DayofMonth == 1)
## # A tibble: 552 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## # ... with 542 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

arrange()

Used to reorder rows.

arrange(hflights_df, DayofMonth, Month, Year)
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Use desc() for descending order

arrange(hflights_df, desc(ArrDelay))
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011    12         12         1     650     808            AA
## 2   2011     8          1         1     156     452            CO
## 3   2011    11          8         2     721     948            MQ
## 4   2011     6         21         2    2334     124            UA
## 5   2011     5         20         5     858    1027            MQ
## 6   2011     6          9         4    2029    2243            MQ
## 7   2011     1         20         4     635     807            CO
## 8   2011     6         22         3     908    1040            CO
## 9   2011    12         13         2     706     824            MQ
## 10  2011    10         25         2    2310     149            DL
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

select()

Select colummns with select()

select(hflights_df, Year, Month, DayOfWeek)
## # A tibble: 227,496 x 3
##     Year Month DayOfWeek
## *  <int> <int>     <int>
## 1   2011     1         6
## 2   2011     1         7
## 3   2011     1         1
## 4   2011     1         2
## 5   2011     1         3
## 6   2011     1         4
## 7   2011     1         5
## 8   2011     1         6
## 9   2011     1         7
## 10  2011     1         1
## # ... with 227,486 more rows
# select columsn from Year to DayOfWeek
select(hflights_df, Year:DayOfWeek)
## # A tibble: 227,496 x 4
##     Year Month DayofMonth DayOfWeek
## *  <int> <int>      <int>     <int>
## 1   2011     1          1         6
## 2   2011     1          2         7
## 3   2011     1          3         1
## 4   2011     1          4         2
## 5   2011     1          5         3
## 6   2011     1          6         4
## 7   2011     1          7         5
## 8   2011     1          8         6
## 9   2011     1          9         7
## 10  2011     1         10         1
## # ... with 227,486 more rows
# select all columns exccept Year to DayOfWeek
select(hflights_df, -(Year:DayOfWeek))
## # A tibble: 227,496 x 17
##    DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
## *    <int>   <int>         <chr>     <int>   <chr>             <int>
## 1     1400    1500            AA       428  N576AA                60
## 2     1401    1501            AA       428  N557AA                60
## 3     1352    1502            AA       428  N541AA                70
## 4     1403    1513            AA       428  N403AA                70
## 5     1405    1507            AA       428  N492AA                62
## 6     1359    1503            AA       428  N262AA                64
## 7     1359    1509            AA       428  N493AA                70
## 8     1355    1454            AA       428  N477AA                59
## 9     1443    1554            AA       428  N476AA                71
## 10    1443    1553            AA       428  N504AA                70
## # ... with 227,486 more rows, and 11 more variables: AirTime <int>,
## #   ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## #   Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## #   CancellationCode <chr>, Diverted <int>

mutate()

Add new columns which are functions of existing columns

mutate(hflights_df,
       gain = ArrDelay - DepDelay,
       speed = Distance / AirTime * 69)
## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 16 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, gain <int>, speed <dbl>

summarise()

Collapses data frame to a single row

summarise(hflights_df,
          delay = mean(DepDelay, na.rm = TRUE))
## # A tibble: 1 x 1
##      delay
##      <dbl>
## 1 9.444951

Grouped Operations

The true power of these functions lies when they are linked together, take this code snippet

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
                   count = n(),
                   dist = mean(Distance, na.rm = TRUE),
                   delay = mean(ArrDelay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# plot them all!
ggplot(data = delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Another example would be to group by destinations to see the number of planes and flights that go to each possible destination.

destinations <- group_by(hflights_df, Dest)
summarise(destinations,
          planes = length(unique(TailNum)),
          flights = n())
## # A tibble: 116 x 3
##     Dest planes flights
##    <chr>  <int>   <int>
## 1    ABQ    716    2812
## 2    AEX    215     724
## 3    AGS      1       1
## 4    AMA    158    1297
## 5    ANC     38     125
## 6    ASE     60     125
## 7    ATL    983    7886
## 8    AUS   1015    5022
## 9    AVL    142     350
## 10   BFL     70     504
## # ... with 106 more rows

I don’t completely understand how groups are made. I except the rows are bunched together because it doesn’t do anything to the size of data frame.

# man the fuck up
# ?group_by

The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”.

That clarifies a lot. Operations are performed for each group instead of each row.

Mutliple levels of grouping

I can also group by multiple variables, each summary then peels off one level.

daily <- group_by(hflights_df, Year, Month, DayofMonth)
(per_day <- summarise(daily, flights = n()))
## Source: local data frame [365 x 4]
## Groups: Year, Month [?]
## 
##     Year Month DayofMonth flights
##    <int> <int>      <int>   <int>
## 1   2011     1          1     552
## 2   2011     1          2     678
## 3   2011     1          3     702
## 4   2011     1          4     583
## 5   2011     1          5     590
## 6   2011     1          6     660
## 7   2011     1          7     661
## 8   2011     1          8     500
## 9   2011     1          9     602
## 10  2011     1         10     659
## # ... with 355 more rows
per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))

That’s about it!