Explore the context of the data first

This vignette showcases a brief exploratory analysis of New York flights data from 2013 using R’s DPLYR and GGPLOT packages.

Prerequisites

You need the following packages installed:

#load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(nycflights13)

Start exploring

Let’s take a look at the context of the NYC flights dataset by looking at the data and statistical summary of the variables. :

#flights that departed NYC in 2013 in tibble format
nycflights13::flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

#statistical summaries of the 19 variables
summary(flights)

##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##  sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   : 106   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 906   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124  
##  Median :1359   Median :  -2.00   Median :1535   Median :1556  
##  Mean   :1344   Mean   :  12.64   Mean   :1502   Mean   :1536  
##  3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945  
##  Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##                 NA's   :8255      NA's   :8713                 
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:336776      Min.   :   1   Length:336776     
##  1st Qu.: -17.000   Class :character   1st Qu.: 553   Class :character  
##  Median :  -5.000   Mode  :character   Median :1496   Mode  :character  
##  Mean   :   6.895                      Mean   :1972                     
##  3rd Qu.:  14.000                      3rd Qu.:3465                     
##  Max.   :1272.000                      Max.   :8500                     
##  NA's   :9430                                                           
##     origin              dest              air_time        distance   
##  Length:336776      Length:336776      Min.   : 20.0   Min.   :  17  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 502  
##  Mode  :character   Mode  :character   Median :129.0   Median : 872  
##                                        Mean   :150.7   Mean   :1040  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##                                        NA's   :9430                  
##       hour           minute        time_hour                  
##  Min.   : 1.00   Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :13.00   Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :13.18   Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:17.00   3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :23.00   Max.   :59.00   Max.   :2013-12-31 23:00:00  
##

Data manipulation: dive in

Looking closely at the data often means you want to manipulate it to see more information. In this case, looking closely at the arrival delay times shows negative and positive numbers, implying some flights are late and others are early.

Dplyr is a package that enables you to wrangle and sort your data, supplying a grammar for data manipulation. It uses five verbs of data manipulation – we will look at filter() and mutate() in this vignette.

First, let’s use filter() to find how many flights arrived more than ten minutes late:

#Find flights arriving 10 minutes late
filter(flights, arr_delay >= 10)

## # A tibble: 94,994 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      554            558        -4      740
##  5  2013     1     1      555            600        -5      913
##  6  2013     1     1      559            600        -1      941
##  7  2013     1     1      600            600         0      837
##  8  2013     1     1      602            605        -3      821
##  9  2013     1     1      608            600         8      807
## 10  2013     1     1      611            600        11      945
## # ... with 94,984 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

OK - so around 95,000 of the 336,800 flights arrived more than 10 minutes late. How might that compare to flights that arrived 10 minutes early?

#Find flights arriving 10 minutes early
filter(flights, arr_delay >= -10)

## # A tibble: 201,989 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      554            558        -4      740
##  5  2013     1     1      555            600        -5      913
##  6  2013     1     1      557            600        -3      838
##  7  2013     1     1      558            600        -2      753
##  8  2013     1     1      558            600        -2      849
##  9  2013     1     1      558            600        -2      853
## 10  2013     1     1      558            600        -2      924
## # ... with 201,979 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Contrary to what you might think about lousy delays at airports, it seems around 202,000 flights arrived early in New York in 2013.

Visualising the data

Ggplot can visualise the emerging story in the data with a raft of plots, colours and displays.

A histogram helps us see the frequency and time delays in minutes of flight arrivals and departures.

#Create 2 histograms showing arr_delay and dep_delay across 2013 flights
par(mfrow=c(1,2))
hist(flights$arr_delay, main = "Arrival Time Delays")
hist(flights$dep_delay, main = "Departure Time Delays")

Hello, the majority of flights arrive early?

A story is starting to emerge around the high frequency of ‘early’ flights clustered to the left of the zero, which marks an ‘on time’ arrival. It also seems that the arr_delay and the dep_delay variables are highly correlated.

Let’s see if there’s any variation to this over the year that the data was captured. For that, let’s use a scatterplot.

Using ggplot for scatterplots

data = flights  
# Plot the time_hour variable against the arrival delay time over 2013
ggplot(data, aes(x=time_hour, y= arr_delay)) +
      geom_point()

The scatterplot above is overplotted with too much detail to clearly see the patterns.

We can use the pipe operator %>% and sample_frac commands to take a random 1% sample of the flights data to get a better visualisation.

Let’s also plot a smoothing line to see the trend around how many flights arrive late or early during the calendar year.

# take a 1% random sample of the flights data to make the plot readable
data = flights %>% sample_frac(.01)
# add a trend line to the plot
geom_smooth(span = 0.1)
ggplot(data, aes(x=time_hour, y= arr_delay)) + 
        geom_point() +
        geom_smooth(span = 1)

The smoothing line reveals that the flights tend to suffer longer delays around July and August.

A trusty old bar plot

The scatterplots give a good indication of the trend for late vs early flights over the year, but it’s hard to see how the trend varies month by month.

Let’s classify flights as either on time or delayed by assuming any flight with a departure delay of more than 5 minutes as delayed. Let’s look at this by month.

Using dplyr’s mutate() function, we can classify flights as on time or delayed, which will let us plot a bar chart showing delays and on-time arrivals each month.

flights <- flights %>%
# creating a new variable to classify if a flight is on time or delayed
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
# plot a bar chart by month
qplot(x = month, fill = dep_type, data = flights, geom = "bar", main = "Frequency of Delayed vs On Time Arrivals by Month")

A final word

The story of the New York flights data is now starting to emerge and a deeper exploratory data analysis can be embarked upon. Some richer ideas to explore could be to examine how weather, airline choice or holiday rushes impact on flight times. Some statistically challenging next steps could be to explore multivariate regression on this dataset. The basic exploration, manipulation and visualisations as outlined in this vignette using DPLYR and GGPLOT are a repeatable process for someone with basic R skills to get started examining data.

How to manipulate and plot flight delays data

Alex Brooks

17 August 2018