This vignette showcases a brief exploratory analysis of New York flights data from 2013 using R’s DPLYR and GGPLOT packages.
You need the following packages installed:
#load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(nycflights13)
Let’s take a look at the context of the NYC flights dataset by looking at the data and statistical summary of the variables. :
#flights that departed NYC in 2013 in tibble format
nycflights13::flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
#statistical summaries of the 19 variables
summary(flights)
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124
## Median :1359 Median : -2.00 Median :1535 Median :1556
## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536
## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945
## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## NA's :8255 NA's :8713
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:336776 Min. : 1 Length:336776
## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character
## Median : -5.000 Mode :character Median :1496 Mode :character
## Mean : 6.895 Mean :1972
## 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :1272.000 Max. :8500
## NA's :9430
## origin dest air_time distance
## Length:336776 Length:336776 Min. : 20.0 Min. : 17
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502
## Mode :character Mode :character Median :129.0 Median : 872
## Mean :150.7 Mean :1040
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## NA's :9430
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00
## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00
##
Looking closely at the data often means you want to manipulate it to see more information. In this case, looking closely at the arrival delay times shows negative and positive numbers, implying some flights are late and others are early.
Dplyr is a package that enables you to wrangle and sort your data, supplying a grammar for data manipulation. It uses five verbs of data manipulation – we will look at filter() and mutate() in this vignette.
First, let’s use filter() to find how many flights arrived more than ten minutes late:
#Find flights arriving 10 minutes late
filter(flights, arr_delay >= 10)
## # A tibble: 94,994 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 554 558 -4 740
## 5 2013 1 1 555 600 -5 913
## 6 2013 1 1 559 600 -1 941
## 7 2013 1 1 600 600 0 837
## 8 2013 1 1 602 605 -3 821
## 9 2013 1 1 608 600 8 807
## 10 2013 1 1 611 600 11 945
## # ... with 94,984 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
OK - so around 95,000 of the 336,800 flights arrived more than 10 minutes late. How might that compare to flights that arrived 10 minutes early?
#Find flights arriving 10 minutes early
filter(flights, arr_delay >= -10)
## # A tibble: 201,989 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 554 558 -4 740
## 5 2013 1 1 555 600 -5 913
## 6 2013 1 1 557 600 -3 838
## 7 2013 1 1 558 600 -2 753
## 8 2013 1 1 558 600 -2 849
## 9 2013 1 1 558 600 -2 853
## 10 2013 1 1 558 600 -2 924
## # ... with 201,979 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Contrary to what you might think about lousy delays at airports, it seems around 202,000 flights arrived early in New York in 2013.
Ggplot can visualise the emerging story in the data with a raft of plots, colours and displays.
A histogram helps us see the frequency and time delays in minutes of flight arrivals and departures.
#Create 2 histograms showing arr_delay and dep_delay across 2013 flights
par(mfrow=c(1,2))
hist(flights$arr_delay, main = "Arrival Time Delays")
hist(flights$dep_delay, main = "Departure Time Delays")
A story is starting to emerge around the high frequency of ‘early’ flights clustered to the left of the zero, which marks an ‘on time’ arrival. It also seems that the arr_delay and the dep_delay variables are highly correlated.
Let’s see if there’s any variation to this over the year that the data was captured. For that, let’s use a scatterplot.
data = flights
# Plot the time_hour variable against the arrival delay time over 2013
ggplot(data, aes(x=time_hour, y= arr_delay)) +
geom_point()
The scatterplot above is overplotted with too much detail to clearly see the patterns.
We can use the pipe operator %>% and sample_frac commands to take a random 1% sample of the flights data to get a better visualisation.
Let’s also plot a smoothing line to see the trend around how many flights arrive late or early during the calendar year.
# take a 1% random sample of the flights data to make the plot readable
data = flights %>% sample_frac(.01)
# add a trend line to the plot
geom_smooth(span = 0.1)
ggplot(data, aes(x=time_hour, y= arr_delay)) +
geom_point() +
geom_smooth(span = 1)
The smoothing line reveals that the flights tend to suffer longer delays around July and August.
The scatterplots give a good indication of the trend for late vs early flights over the year, but it’s hard to see how the trend varies month by month.
Let’s classify flights as either on time or delayed by assuming any flight with a departure delay of more than 5 minutes as delayed. Let’s look at this by month.
Using dplyr’s mutate() function, we can classify flights as on time or delayed, which will let us plot a bar chart showing delays and on-time arrivals each month.
flights <- flights %>%
# creating a new variable to classify if a flight is on time or delayed
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
# plot a bar chart by month
qplot(x = month, fill = dep_type, data = flights, geom = "bar", main = "Frequency of Delayed vs On Time Arrivals by Month")
The story of the New York flights data is now starting to emerge and a deeper exploratory data analysis can be embarked upon. Some richer ideas to explore could be to examine how weather, airline choice or holiday rushes impact on flight times. Some statistically challenging next steps could be to explore multivariate regression on this dataset. The basic exploration, manipulation and visualisations as outlined in this vignette using DPLYR and GGPLOT are a repeatable process for someone with basic R skills to get started examining data.