Introduction to dplyr

I am following this document, Introduction to dplyr

When working with data you must,

Figure out what you want to do
Precise describe what you want in the form of a computer program
Execute the code

Data `hflights`

Let’s explore this dataset,

# read the fucking manual
# ?hflights

# just figured this dataset was available as a standalone package
# install.packages("hflights")
library(hflights)

# now read the fucking manual
# ?hflights
head(hflights)

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

dim(hflights)

## [1] 227496     21

Moving on with the document. tble_df() comes in handy as a wrapper around a data frame that won’t accidentally print a lot of stuff.

hflights_df <- tbl_df(hflights)
hflights_df

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## *  <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

dplyr verbs

`filter()`

Works the same way as subset() but more intuitive to use. To filter all the flights on January 1st,

filter(hflights_df, Month == 1, DayofMonth == 1)

## # A tibble: 552 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## # ... with 542 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

`arrange()`

Used to reorder rows.

arrange(hflights_df, DayofMonth, Month, Year)

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Use desc() for descending order

arrange(hflights_df, desc(ArrDelay))

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011    12         12         1     650     808            AA
## 2   2011     8          1         1     156     452            CO
## 3   2011    11          8         2     721     948            MQ
## 4   2011     6         21         2    2334     124            UA
## 5   2011     5         20         5     858    1027            MQ
## 6   2011     6          9         4    2029    2243            MQ
## 7   2011     1         20         4     635     807            CO
## 8   2011     6         22         3     908    1040            CO
## 9   2011    12         13         2     706     824            MQ
## 10  2011    10         25         2    2310     149            DL
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

`select()`

Select colummns with select()

select(hflights_df, Year, Month, DayOfWeek)

## # A tibble: 227,496 x 3
##     Year Month DayOfWeek
## *  <int> <int>     <int>
## 1   2011     1         6
## 2   2011     1         7
## 3   2011     1         1
## 4   2011     1         2
## 5   2011     1         3
## 6   2011     1         4
## 7   2011     1         5
## 8   2011     1         6
## 9   2011     1         7
## 10  2011     1         1
## # ... with 227,486 more rows

# select columsn from Year to DayOfWeek
select(hflights_df, Year:DayOfWeek)

## # A tibble: 227,496 x 4
##     Year Month DayofMonth DayOfWeek
## *  <int> <int>      <int>     <int>
## 1   2011     1          1         6
## 2   2011     1          2         7
## 3   2011     1          3         1
## 4   2011     1          4         2
## 5   2011     1          5         3
## 6   2011     1          6         4
## 7   2011     1          7         5
## 8   2011     1          8         6
## 9   2011     1          9         7
## 10  2011     1         10         1
## # ... with 227,486 more rows

# select all columns exccept Year to DayOfWeek
select(hflights_df, -(Year:DayOfWeek))

## # A tibble: 227,496 x 17
##    DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
## *    <int>   <int>         <chr>     <int>   <chr>             <int>
## 1     1400    1500            AA       428  N576AA                60
## 2     1401    1501            AA       428  N557AA                60
## 3     1352    1502            AA       428  N541AA                70
## 4     1403    1513            AA       428  N403AA                70
## 5     1405    1507            AA       428  N492AA                62
## 6     1359    1503            AA       428  N262AA                64
## 7     1359    1509            AA       428  N493AA                70
## 8     1355    1454            AA       428  N477AA                59
## 9     1443    1554            AA       428  N476AA                71
## 10    1443    1553            AA       428  N504AA                70
## # ... with 227,486 more rows, and 11 more variables: AirTime <int>,
## #   ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## #   Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## #   CancellationCode <chr>, Diverted <int>

`mutate()`

Add new columns which are functions of existing columns

mutate(hflights_df,
       gain = ArrDelay - DepDelay,
       speed = Distance / AirTime * 69)

## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 16 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, gain <int>, speed <dbl>

`summarise()`

Collapses data frame to a single row

summarise(hflights_df,
          delay = mean(DepDelay, na.rm = TRUE))

## # A tibble: 1 x 1
##      delay
##      <dbl>
## 1 9.444951

Grouped Operations

The true power of these functions lies when they are linked together, take this code snippet

planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
                   count = n(),
                   dist = mean(Distance, na.rm = TRUE),
                   delay = mean(ArrDelay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# plot them all!
ggplot(data = delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Another example would be to group by destinations to see the number of planes and flights that go to each possible destination.

destinations <- group_by(hflights_df, Dest)
summarise(destinations,
          planes = length(unique(TailNum)),
          flights = n())

## # A tibble: 116 x 3
##     Dest planes flights
##    <chr>  <int>   <int>
## 1    ABQ    716    2812
## 2    AEX    215     724
## 3    AGS      1       1
## 4    AMA    158    1297
## 5    ANC     38     125
## 6    ASE     60     125
## 7    ATL    983    7886
## 8    AUS   1015    5022
## 9    AVL    142     350
## 10   BFL     70     504
## # ... with 106 more rows

I don’t completely understand how groups are made. I except the rows are bunched together because it doesn’t do anything to the size of data frame.

# man the fuck up
# ?group_by

The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”.

That clarifies a lot. Operations are performed for each group instead of each row.

Mutliple levels of grouping

I can also group by multiple variables, each summary then peels off one level.

daily <- group_by(hflights_df, Year, Month, DayofMonth)
(per_day <- summarise(daily, flights = n()))

## Source: local data frame [365 x 4]
## Groups: Year, Month [?]
## 
##     Year Month DayofMonth flights
##    <int> <int>      <int>   <int>
## 1   2011     1          1     552
## 2   2011     1          2     678
## 3   2011     1          3     702
## 4   2011     1          4     583
## 5   2011     1          5     590
## 6   2011     1          6     660
## 7   2011     1          7     661
## 8   2011     1          8     500
## 9   2011     1          9     602
## 10  2011     1         10     659
## # ... with 355 more rows

per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))

That’s about it!

Introduction to dplyr

Shailendra

10 August 2016

When working with data you must,

Data `hflights`

dplyr verbs

`filter()`

`arrange()`

`select()`

`mutate()`

`summarise()`

Grouped Operations

Mutliple levels of grouping

Introduction to dplyr

Shailendra

10 August 2016

When working with data you must,

Data hflights

dplyr verbs

filter()

arrange()

select()

mutate()

summarise()

Grouped Operations

Mutliple levels of grouping

Data `hflights`

`filter()`

`arrange()`

`select()`

`mutate()`

`summarise()`