PURPOSE: we explore the flights dataset available in nycflights13 using dplyr and lubridate.
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Not all flights actually occurred. We assume that if both dep_time and arr_time are NA, the flight was cancelled. These flights are not of interest we use filter() to remove them and overwrite flights in our workspace.
flights <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time))
## Warning: package 'bindrcpp' was built under R version 3.4.4
flights
## # A tibble: 328,063 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 328,053 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
My last flight was to Montreal. I thought it would be ok to arrive just in time because I had a purchased seat reservation and had confirmed the flight. So when I arrived 5 minutes before takeover, I was surprised to find the staff were “in a fap”. Apparently there had been numerous annoucements calling for passenger McLeod to go to the departure gate immediately. After quickly going through security, I boarded almost full plane and noticed the amused look on many faces!
An early departure is indicated by dep_delay < 0. We could use filter to select all flights early flights and use nrow() to determine the proportion.
nrow(filter(flights, dep_delay<0))/nrow(flights)
## [1] 0.5591731
Amazing about 56% leave early!
Let’s find out how many destinations there are and their relative frequency - that is we need to sort in descending order of frequency.
We see there are 104 destinations. The top 3 destinations are ALT, BOS and ORD with over 9500 flights in 2013.
DEST <- flights %>%
filter(dep_delay<0) %>%
select(dest) %>%
group_by(dest) %>%
summarize(count=n()) %>%
arrange(desc(count))
DEST
## # A tibble: 104 x 2
## dest count
## <chr> <int>
## 1 ATL 9871
## 2 BOS 9687
## 3 ORD 9556
## 4 CLT 8826
## 5 LAX 8478
## 6 MCO 7746
## 7 MIA 6874
## 8 SFO 6566
## 9 FLL 6244
## 10 DCA 5841
## # ... with 94 more rows
We see there was only 1 flight in 2013 to ANC and LEX.
arrange(DEST, count)
## # A tibble: 104 x 2
## dest count
## <chr> <int>
## 1 ANC 1
## 2 LEX 1
## 3 MTJ 3
## 4 JAC 4
## 5 SBN 4
## 6 HDN 5
## 7 EYW 8
## 8 PSP 13
## 9 BZN 15
## 10 CHO 27
## # ... with 94 more rows
There are 104 destinations. A summary of the number of flights to each is provided by the Tukey 5 number summary available in base R, stats::fivenum(). Note the use of dot notation with braces, {. }.
flights %>%
filter(dep_delay<0) %>%
select(dest) %>%
group_by(dest) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
{fivenum(.$count)}
## [1] 1.0 173.0 784.5 2131.5 9871.0
For each destination we divide the number of early departures by the total number of flights. Note that we construct two tibbles and merge them use the dplyr::inner_join(). This is one of the important functions discussed in Ch 10 *Relational Data with dplyr**.
allByDest <- flights %>%
select(dest) %>%
group_by(dest) %>%
summarize(Total=n()) %>%
arrange(desc(Total))
earlyByDest <- flights %>%
filter(dep_delay<0) %>%
select(dest) %>%
group_by(dest) %>%
summarize(Early=n()) %>%
arrange(desc(Early))
inner_join(allByDest, earlyByDest) %>%
mutate(probEarly=Early/Total) %>%
ggplot(aes(x=probEarly)) +
geom_histogram(bins=10)
## Joining, by = "dest"
There is one destination with 100% probability of early departure! Let’s drill down and find out which.
inner_join(allByDest, earlyByDest) %>%
mutate(probEarly=Early/Total) %>%
arrange(desc(probEarly))
## Joining, by = "dest"
## # A tibble: 104 x 4
## dest Total Early probEarly
## <chr> <int> <int> <dbl>
## 1 LEX 1 1 1
## 2 PSP 18 13 0.722
## 3 XNA 1005 721 0.717
## 4 SRQ 1202 792 0.659
## 5 AVL 263 172 0.654
## 6 MVY 211 137 0.649
## 7 ACK 265 172 0.649
## 8 DFW 8453 5453 0.645
## 9 CLT 13684 8826 0.645
## 10 BOS 15028 9687 0.645
## # ... with 94 more rows
Well no surprise! There was only one flight to LEX and it left early.
We see that there is a group of flights with delays of 4 minutes or less. Also that not many flights have delays more than \(2^7\) minutes = 2.1 hours.
flights %>%
filter(dep_delay>0) %>%
ggplot(aes(x=log(dep_delay, base=2))) +
geom_histogram(binwidth=0.25) +
scale_x_continuous(breaks=1:10) +
ggtitle("Departure delays in log_2(minutes)")
We construct two tibbles one for delayed flights, which a delay of more than 4 minutes, and the other for all flights. Then use inner_join() and compute the proportion of delayed flights.
nD <- flights %>%
filter(dep_delay>4) %>%
group_by(dest) %>%
summarize(numDelayed=n()) %>%
arrange(desc(numDelayed))
#
nF <- flights %>%
group_by(dest) %>%
summarize(numFlights=n()) %>%
arrange(desc(numFlights))
#
nFD <- inner_join(nF, nD) %>%
mutate(probDelay=numDelayed/numFlights) %>%
arrange(desc(probDelay))
## Joining, by = "dest"
The Tukey five-number summary 8, 358, 1439, 4096.5, 16873 indicates that the distribution is right-skewed.
From the table below, the most popular flights seem to be delayed the most frequently.
arrange(nFD, desc(numFlights))
## # A tibble: 103 x 4
## dest numFlights numDelayed probDelay
## <chr> <int> <int> <dbl>
## 1 ATL 16873 4878 0.289
## 2 ORD 16607 5153 0.310
## 3 LAX 16058 4714 0.294
## 4 BOS 15028 3897 0.259
## 5 MCO 13979 4283 0.306
## 6 CLT 13684 3499 0.256
## 7 SFO 13212 4310 0.326
## 8 FLL 11930 4060 0.340
## 9 MIA 11625 3064 0.264
## 10 DCA 9125 2551 0.280
## # ... with 93 more rows
We construct a daily time series showing the number of minutes of delay for all destinations with more than 14000 flights in 2013 for which the departure delay exceeded 4 minutes. We use lubridate::make_date() to construct the Date and then use ggplot().
arrange(nFD, desc(numFlights)) %>%
filter(numFlights > 14000) %>%
select(dest) %>%
inner_join(., filter(flights, dep_delay>4)) %>%
group_by(month, day) %>%
summarize(mdelay=median(dep_delay, na.rm=TRUE), numdelay=n()) %>%
mutate(date=make_date(year=2013, month=month, day=day)) %>%
ggplot(aes(x=date, y=mdelay)) +
geom_line() +
geom_smooth(se=FALSE) +
ggtitle("Median non-negligible delay for most frequent flights") +
labs(y="delay in minutes")
## Joining, by = "dest"
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To plot using ggplot(), we want to replace dep_time which is in seconds with an R Date. In lubridate we can use make_datetime(),
args(make_datetime)
## function (year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L,
## sec = 0, tz = "UTC")
## NULL
We see make_datetime() requires args hour and minute but the tibble flights$dep_time is the time in seconds from midnight. Using the modulo functions %% and %/% we can convert this to the required formats.
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime(year, month, day,
hour = dep_time%/%100, min = dep_time%%100)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 x 9
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <int>
## 1 EWR IAH 2 11 2013-01-01 05:17:00 515
## 2 LGA IAH 4 20 2013-01-01 05:33:00 529
## 3 JFK MIA 2 33 2013-01-01 05:42:00 540
## 4 JFK BQN -1 -18 2013-01-01 05:44:00 545
## 5 LGA ATL -6 -25 2013-01-01 05:54:00 600
## 6 EWR ORD -4 12 2013-01-01 05:54:00 558
## 7 EWR FLL -5 19 2013-01-01 05:55:00 600
## 8 LGA IAD -3 -14 2013-01-01 05:57:00 600
## 9 JFK MCO -3 -8 2013-01-01 05:57:00 600
## 10 LGA ORD -2 8 2013-01-01 05:58:00 600
## # ... with 328,053 more rows, and 3 more variables: arr_time <int>,
## # sched_arr_time <int>, air_time <dbl>
We can use geom_freqpoly() to visualise the distribution of departure times across the 2013 year:
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
Or within a single day, for example, in the U.S. Thanksgiving holiday was on Thursday November 28. Let’s see the pattern of flights on the day before.
flights_dt %>%
filter(dep_time < make_datetime(2013, 11, 28, 0, 0),
dep_time >= make_datetime(2013, 11, 27, 0, 0)
) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 3*600) + # 600 s = 10 minutes
ggtitle("NYC Airport. Total departures every 30 minutes.",
subtitle="November 27, 2013")
We see there are 336,766 flights but this includes cancelled flights. The major international airport in Hawii is in Honolulu with code HNL. Eliminating all cancelled flights and selecting only flights to Hawaii.
hflights <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time), dest=="HNL") %>%
select(-dest)
hflights
## # A tibble: 705 x 18
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 857 900 -3 1516
## 2 2013 1 1 1344 1344 0 2005
## 3 2013 1 2 909 900 9 1525
## 4 2013 1 2 1344 1344 0 1940
## 5 2013 1 3 914 900 14 1504
## 6 2013 1 3 1418 1341 37 2006
## 7 2013 1 4 900 900 0 1516
## 8 2013 1 4 1343 1341 2 1932
## 9 2013 1 5 858 900 -2 1519
## 10 2013 1 5 1329 1335 -6 1850
## # ... with 695 more rows, and 11 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
So there are 705 all together.
Which airlines fly there from NYC?
hflights %>%
{ unique(.$carrier) }
## [1] "HA" "UA"
Remark: First attempt, left out the braces - using google I found help here
https://stackoverflow.com/questions/42385010/using-the-pipe-and-dot-notation
For brevity the variable carrier uses an abbreviated code to denote the airline. The tibble airlines gives the name, so we see the two airlines are Hawaiian Airlines Inc., United Air Lines Inc.
The package lubridate is included in the recommended tidyverse but it needs to attached. This package provides very simplified and convenient functions for working with dates. The flights tibble contains the variable time_hour which is is recognized in date-time format,
Let’s look at barcharts for flights by day-of-week. We find there are about 100 flights each day – a little less on Tuesdays and Thursdays.
hflights %>%
mutate(dayOfWeek=wday(hflights$time_hour, label=TRUE)) %>%
ggplot(aes(dayOfWeek)) +
geom_bar()
Similarly the barchart for months. There are about 60 flights per month so about 2 every day assuming flights are evenly distributed over the month.
hflights %>%
mutate(monthName = month.abb[month]) %>%
ggplot(aes(monthName)) +
geom_bar()
We can solve this problem by converting the year, month and day into Date format using lubridate::make_date(). Then we can construct the set of all Dates in 2013 and see if all flight days are included. We find that they are!
start <- make_date(year=2013, month=1, day=1)
days2013 <- start+days(0:364)
hdays <- with(hflights, make_date(year=2013, month=month, day=day))
all(days2013 %in% hdays)
## [1] TRUE
Using base::table() we need to use the braces with .-notation.
hflights %>%
mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
group_by(hdays) %>%
summarize(n=n()) %>%
{table(.$n)}
##
## 1 2
## 25 340
But using stats::table() we don’t need to use the braces.
hflights %>%
mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
group_by(hdays) %>%
summarize(n=n()) %>%
xtabs(~n, .)
## n
## 1 2
## 25 340
hflights %>%
mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
group_by(hdays) %>%
summarize(avFlyTime = mean(air_time, na.rm=TRUE)/60) %>%
filter(!is.nan(avFlyTime)) %>% #there is one of these!
ggplot(aes(x=hdays, y=avFlyTime)) +
geom_line() +
geom_smooth(method="loess", color=alpha("blue", 0.5), size=2,
se=FALSE, method.args=list(degree=2)) +
ggtitle("Average Flying Time from NYC to Honolulu") +
labs(x="date", y="time in hours")
hflights %>%
mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
group_by(hdays) %>%
summarize(avFlyTime = mean(air_time, na.rm=TRUE)/60) %>%
filter(!is.nan(avFlyTime)) %>% #there is one of these!
ggplot(aes(x=avFlyTime, y=..density..)) +
geom_histogram(binwidth=0.2, fill="lightblue") +
geom_density(size=2, bw = "SJ-dpi") +
ggtitle("Average Flying Time from NYC to Honolulu") +
labs(x="time in hours")