I am following this document, Introduction to dplyr
hflightsLet’s explore this dataset,
# read the fucking manual
# ?hflights
# just figured this dataset was available as a standalone package
# install.packages("hflights")
library(hflights)
# now read the fucking manual
# ?hflights
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
dim(hflights)
## [1] 227496 21
Moving on with the document. tble_df() comes in handy as a wrapper around a data frame that won’t accidentally print a lot of stuff.
hflights_df <- tbl_df(hflights)
hflights_df
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## * <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
filter()Works the same way as subset() but more intuitive to use. To filter all the flights on January 1st,
filter(hflights_df, Month == 1, DayofMonth == 1)
## # A tibble: 552 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 1 6 728 840 AA
## 3 2011 1 1 6 1631 1736 AA
## 4 2011 1 1 6 1756 2112 AA
## 5 2011 1 1 6 1012 1347 AA
## 6 2011 1 1 6 1211 1325 AA
## 7 2011 1 1 6 557 906 AA
## 8 2011 1 1 6 1824 2106 AS
## 9 2011 1 1 6 654 1124 B6
## 10 2011 1 1 6 1639 2110 B6
## # ... with 542 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
arrange()Used to reorder rows.
arrange(hflights_df, DayofMonth, Month, Year)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 1 6 728 840 AA
## 3 2011 1 1 6 1631 1736 AA
## 4 2011 1 1 6 1756 2112 AA
## 5 2011 1 1 6 1012 1347 AA
## 6 2011 1 1 6 1211 1325 AA
## 7 2011 1 1 6 557 906 AA
## 8 2011 1 1 6 1824 2106 AS
## 9 2011 1 1 6 654 1124 B6
## 10 2011 1 1 6 1639 2110 B6
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
Use desc() for descending order
arrange(hflights_df, desc(ArrDelay))
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 12 12 1 650 808 AA
## 2 2011 8 1 1 156 452 CO
## 3 2011 11 8 2 721 948 MQ
## 4 2011 6 21 2 2334 124 UA
## 5 2011 5 20 5 858 1027 MQ
## 6 2011 6 9 4 2029 2243 MQ
## 7 2011 1 20 4 635 807 CO
## 8 2011 6 22 3 908 1040 CO
## 9 2011 12 13 2 706 824 MQ
## 10 2011 10 25 2 2310 149 DL
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
select()Select colummns with select()
select(hflights_df, Year, Month, DayOfWeek)
## # A tibble: 227,496 x 3
## Year Month DayOfWeek
## * <int> <int> <int>
## 1 2011 1 6
## 2 2011 1 7
## 3 2011 1 1
## 4 2011 1 2
## 5 2011 1 3
## 6 2011 1 4
## 7 2011 1 5
## 8 2011 1 6
## 9 2011 1 7
## 10 2011 1 1
## # ... with 227,486 more rows
# select columsn from Year to DayOfWeek
select(hflights_df, Year:DayOfWeek)
## # A tibble: 227,496 x 4
## Year Month DayofMonth DayOfWeek
## * <int> <int> <int> <int>
## 1 2011 1 1 6
## 2 2011 1 2 7
## 3 2011 1 3 1
## 4 2011 1 4 2
## 5 2011 1 5 3
## 6 2011 1 6 4
## 7 2011 1 7 5
## 8 2011 1 8 6
## 9 2011 1 9 7
## 10 2011 1 10 1
## # ... with 227,486 more rows
# select all columns exccept Year to DayOfWeek
select(hflights_df, -(Year:DayOfWeek))
## # A tibble: 227,496 x 17
## DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime
## * <int> <int> <chr> <int> <chr> <int>
## 1 1400 1500 AA 428 N576AA 60
## 2 1401 1501 AA 428 N557AA 60
## 3 1352 1502 AA 428 N541AA 70
## 4 1403 1513 AA 428 N403AA 70
## 5 1405 1507 AA 428 N492AA 62
## 6 1359 1503 AA 428 N262AA 64
## 7 1359 1509 AA 428 N493AA 70
## 8 1355 1454 AA 428 N477AA 59
## 9 1443 1554 AA 428 N476AA 71
## 10 1443 1553 AA 428 N504AA 70
## # ... with 227,486 more rows, and 11 more variables: AirTime <int>,
## # ArrDelay <int>, DepDelay <int>, Origin <chr>, Dest <chr>,
## # Distance <int>, TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## # CancellationCode <chr>, Diverted <int>
mutate()Add new columns which are functions of existing columns
mutate(hflights_df,
gain = ArrDelay - DepDelay,
speed = Distance / AirTime * 69)
## # A tibble: 227,496 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## # ... with 227,486 more rows, and 16 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, gain <int>, speed <dbl>
summarise()Collapses data frame to a single row
summarise(hflights_df,
delay = mean(DepDelay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 9.444951
The true power of these functions lies when they are linked together, take this code snippet
planes <- group_by(hflights_df, TailNum)
delay <- summarise(planes,
count = n(),
dist = mean(Distance, na.rm = TRUE),
delay = mean(ArrDelay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
# plot them all!
ggplot(data = delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Another example would be to group by destinations to see the number of planes and flights that go to each possible destination.
destinations <- group_by(hflights_df, Dest)
summarise(destinations,
planes = length(unique(TailNum)),
flights = n())
## # A tibble: 116 x 3
## Dest planes flights
## <chr> <int> <int>
## 1 ABQ 716 2812
## 2 AEX 215 724
## 3 AGS 1 1
## 4 AMA 158 1297
## 5 ANC 38 125
## 6 ASE 60 125
## 7 ATL 983 7886
## 8 AUS 1015 5022
## 9 AVL 142 350
## 10 BFL 70 504
## # ... with 106 more rows
I don’t completely understand how groups are made. I except the rows are bunched together because it doesn’t do anything to the size of data frame.
# man the fuck up
# ?group_by
The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”.
That clarifies a lot. Operations are performed for each group instead of each row.
I can also group by multiple variables, each summary then peels off one level.
daily <- group_by(hflights_df, Year, Month, DayofMonth)
(per_day <- summarise(daily, flights = n()))
## Source: local data frame [365 x 4]
## Groups: Year, Month [?]
##
## Year Month DayofMonth flights
## <int> <int> <int> <int>
## 1 2011 1 1 552
## 2 2011 1 2 678
## 3 2011 1 3 702
## 4 2011 1 4 583
## 5 2011 1 5 590
## 6 2011 1 6 660
## 7 2011 1 7 661
## 8 2011 1 8 500
## 9 2011 1 9 602
## 10 2011 1 10 659
## # ... with 355 more rows
per_month <- summarise(per_day, flights = sum(flights))
per_year <- summarise(per_month, flights = sum(flights))
That’s about it!