Introduction

The dataset we are analyzing today is the NYC Flights dataset, which is part of the Tidyverse package. Firstly, we load the dataset, inspect its overall structure and look through the first six rows of the data.

library(nycflights13)
require(nycflights13)
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

This is a very big dataset with 32,735 rows and 16 columns. It contains the details of both inbound and outbound flights to and out of three major New York City airports, which are JFK, La Guardia, as well as and Newark Liberty in New Jersey, for the entirety of the year 2013.

Departure delays: What is the distribution of delayed flights from JFK airport?

We are able to obtain many interesting insights from the dataset, one in particular is related to departure and arrival delays at the airport. Let’s look at the the distribution of departure delays out of NYC’s largest airport. We can start by subsetting the data for flights that are only flying out of JFK.

jfk_origin <- flights[flights$origin %in% c("JFK"), ]
dim(jfk_origin)
## [1] 111279     19

We find that about 1/3 of the dataset originate from JFK. Next, lets look at the disribution of departure delays in our new subsetted dataset. We are interested in the shape of the distribution and basic descriptive statistics on the dataset, for which purpose we will also exclude empty values from the calculations.

library(ggplot2)
ggplot(data=na.omit(jfk_origin), aes(x=dep_delay))+
  geom_histogram(bins=50)+
  labs(title = "Histogram of JFK airport departure delays",
       subtitle = "(all units in Mins)",
       x = "Mins",
       y = "Frequency")

missing_values<- is.na(jfk_origin$dep_delay)
sum(missing_values)
## [1] 1863
mode_jfk <- sort(table(jfk_origin$dep_delay), decreasing=TRUE)[1]

cat("Range:", range(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Range: -43 1301
cat("Mean:", mean(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Mean: 12.11216
cat("Median:", median(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Median: -1
cat("Mode value:", names(mode_jfk), "freq: ", mode_jfk)
## Mode value: -3 freq:  9042
mode <- sort(table(jfk_origin$dep_delay), decreasing=TRUE)[1]

From the histogram of the departure delays data, we are also able to gather the following insights:

Which airlines should we choose?

Let’s look at which carrier tend to be the worst performer for punctuality

library(dplyr)  # Load the dplyr package
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
jfk_carrier<-jfk_origin %>%
  group_by(carrier) %>%
  summarise(
    mean_delay = mean(dep_delay, na.rm = TRUE),
    std_dev_delay = sd(dep_delay, na.rm = TRUE),
    median_delay = median(dep_delay, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_delay))

jfk_flight_freq <- as.data.frame(table(jfk_origin$carrier))
jfk_flight_freq <- jfk_flight_freq[order(jfk_flight_freq$Freq), ]
names(jfk_flight_freq) <- c("carrier", "freq")
merged_df<- merge(jfk_carrier, jfk_flight_freq, by.x = 'carrier', all.x = TRUE)
merged_df[order(-merged_df$mean_delay), ]
##    carrier mean_delay std_dev_delay median_delay  freq
## 1       9E  19.001517      48.08512           -1 14651
## 5       EV  18.520362      56.04845           -2  1408
## 10      VX  13.279441      44.86116            0  3596
## 7       MQ  13.199971      43.87308           -2  7193
## 3       B6  12.757453      36.12452           -1 42076
## 2       AA  10.302155      37.98175           -2 13783
## 4       DL   8.333188      35.16894           -2 20701
## 8       UA   7.900000      32.61058           -2  4534
## 9       US   5.866959      29.30178           -2  2995
## 6       HA   4.900585      74.10990           -4   342

When we rank the carriers by mean of their delayed departures, 9E (Endeavor Air) comes up on top with a mean of 19 minutes departure delays.

Meanwhile on the bottom of the chart we have HA (Hawaiian Airlines) which have both the lowest mean and median of departure delays in the dataset. However, HA also has the lowest amount of flights (342 flights during the year, which average to slightly below one flight everyday), which makes it mean susceptible to outliers. This is also supported by the fact that this airline also has the highest level of spread measure (standard deviation) in the dataset.

As we have established earlier that the shape of the distribution is right skewed, it is adviseable to use median as a measure of centredness in order to keep the data robust from outliers. Note that this problem is not only isolated to the mean, as standard deviation as measure of dispersion is also not robust from outliers. As such, IQR (interquartile range) would probably be a better measure of dispersion compared to standard deviation.

Which carrier is the most reliable and is subject to the least amount of outlier flights? The answer appears to be (UA) United Airlines, US (US Airways) which have among the lowest variability, the lowest mean and median lower than the population median as well.

When is the best time to fly out of JFK?

Is there any particular month that worried NYC travelers should avoid?

jfk_origin %>% 
  group_by(month) %>% 
  summarise(mean_delay=mean(dep_delay, na.rm=TRUE), std_dev_delay=sd(dep_delay, na.rm=TRUE), median_delay=median(dep_delay, na.rm=TRUE), iqr=IQR(dep_delay, na.rm=TRUE), frequency= n()) %>% 
  arrange(desc(mean_delay))
## # A tibble: 12 × 6
##    month mean_delay std_dev_delay median_delay   iqr frequency
##    <int>      <dbl>         <dbl>        <dbl> <dbl>     <int>
##  1     7      23.8           53.3            1    30     10023
##  2     6      20.5           50.2            0    25      9472
##  3    12      14.8           39.1            0    19      9146
##  4     8      12.9           36.3           -1    16      9983
##  5     5      12.5           38.5           -1    15      9397
##  6     4      12.2           41.2           -2    14      9218
##  7     2      11.8           37.4           -1    16      8421
##  8     3      10.7           35.3           -2    14      9697
##  9     1       8.62          36.0           -2    10      9161
## 10     9       6.64          32.5           -3     8      8908
## 11    11       4.68          27.1           -3     7      8710
## 12    10       4.59          25.2           -3     7      9143

Perhaps not very surprising result, but the summer holidays (June-July) seem to be the months with highest mean and median for delay, among the highest variability of delays (which is also explained by among the highest interquartile range as well). This is followed by December, another popular holiday month. In terms of frequency of flights throughout the month, there are also more flights happening during the top 3 months (June, July, December) compared to the rest of the year i.e. 44% of all flights during the year happens during these top 3 months.

Let’s look at the breakdown of departure delay data based on the months:

ggplot(jfk_origin, aes(x=factor(month), y=dep_delay))+
  geom_boxplot()
## Warning: Removed 1863 rows containing non-finite values (`stat_boxplot()`).

If we look at the boxplots by month, it is clear that based on the position of the box, the median is higher for the holiday months June and July, as well as December, and the variability is also the highest for the holiday months. Meanwhile only in January and June do we see significant outliers of departure delays, although the number of flights are not significantly more during these months.

Which airport has the best performance in terms of punctuality?

Let’s get back to the full dataset now and try to answer this question. We start by creating a new column we will call dep_type. This column will iterate through each row and seek to categorize departures into two types i.e. ‘Delayed’ or ‘On Time’. We will define any flights that still manage to depart within 5 minutes of the scheduled time to be ‘On Time’, otherwise it will be ‘Delayed’

flights<- flights %>% 
  mutate(dep_type = ifelse(dep_delay<5, "On Time", "Delayed"))

head(flights)
## # A tibble: 6 × 20
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, dep_type <chr>

We will now look at which airports have the highest and lowest On Time percentage among the ones presented in our dataset. This is defined as a percentage calculated by ‘On Time’ / (‘On Time’ + ‘Delayed’))

flights %>% 
  na.omit() %>% 
  group_by(origin) %>% 
  summarise(ot_dep_rate = sum(dep_type == "On Time") / n()) %>% 
  arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.729
## 2 JFK          0.691
## 3 EWR          0.640

From data above, La Guardia Airport in New York comes up at the top for the highest On Time departure rate, at 73% of all total flights in 2013. Meanwhile, New Jersey’s Newark airport is ranked the lowest at 64%.

Let’s look at some visualization for airports and departure types.

ggplot(data=flights, aes(x=origin, fill=dep_type))+
  geom_bar()

From the chart above, we can see that EWR does have the most amount of flights, and it also has the largest proportion of delayed flights out of the three US airports.

flights %>%
  na.omit() %>% 
  group_by(origin) %>% 
  summarise(no_of_flights=n()) %>% 
  arrange(desc(no_of_flights))
## # A tibble: 3 × 2
##   origin no_of_flights
##   <chr>          <int>
## 1 EWR           117127
## 2 JFK           109079
## 3 LGA           101140

That’s on the departure side of things. How about from the arrival side? An airport’s efficiency should be judged based on both the average departure delay as well as arrival delay times. We create a new column called arr_type, in which any flights with arrival delay of 0 and lower is ‘On Time’, and is otherwise ‘Delayed’.

flights<- flights %>% 
  mutate(arr_type = ifelse(arr_delay<=0, "On Time", "Delayed"))

flights %>% 
  na.omit() %>% 
  group_by(origin) %>% 
  summarise(arr_rate = sum(arr_type == "On Time") / n()) %>% 
  arrange(desc(arr_rate))
## # A tibble: 3 × 2
##   origin arr_rate
##   <chr>     <dbl>
## 1 JFK       0.607
## 2 LGA       0.604
## 3 EWR       0.572

The result is the same as the departure delay analysis. The airport with the highest On Time percentage on arrival times is still La Guardia airport, followed very closely by JFK. New Jersey’s Newark’s airport comes at more than 300bps margin at third place.

The conclusion remains that worried travelers should prioritize flying out of La Guardia to minimize any risk of delay.

What carrier travels the fastest?

Another insight we can gather from the dataset is an insight into the average speed for each flight. We can start by creating a new column called average_speed, and this column will divide distance by air time.

Since this is an American data, we understand that distance is stated in miles. The first step would be to convert the air time (which is stated in minutes), to hour, and then we can work out average_speed in miles per hour unit.

#Convert air time to hours by creating a new column called air_time_hour
flights$air_time_hour <- round(flights$air_time/60,1)

#Create a new column called average_speed by dividing distance in miles by air_time_hour
flights <- flights %>%
  na.omit() %>% 
  mutate(average_speed = distance / air_time_hour)

#summarise the information by carrier
flights %>% 
  arrange(desc(average_speed)) %>% 
  select(average_speed, tailnum, carrier)
## # A tibble: 327,346 × 3
##    average_speed tailnum carrier
##            <dbl> <chr>   <chr>  
##  1          693. N666DN  DL     
##  2          660  N14568  EV     
##  3          630  N17196  EV     
##  4          623. N12567  EV     
##  5          613. N947UW  US     
##  6          575  N956DL  DL     
##  7          571. N3768   DL     
##  8          569  N11544  EV     
##  9          560. N5FFAA  AA     
## 10          551. N3762Y  DL     
## # ℹ 327,336 more rows

From the table above, we see that tailnumber N666DN which belongs to Delta Airlines travels at the highest average speed of 693 miles per hour. A lot of factors could influence an aircraft’s average speed, among which is distance as well as altitude. As we do have the distance data, let’s visualize the relationship between average speed as well as distance traveled:

plot(flights$average_speed, flights$distance)

From the above visualization we could see a positive relationship between distance as well as average speed with a seemingly natural maximum at around 500-600 miles per hour (although we do have an outlier at 700 miles per hour). In general, the higher the distance traveled, the higher is the average speed. This may be explained by different types of carrier being used for long haul flights (e.g. type of aircraft used may be larger and they may travel faster), altitude as well as air traffic and airports (taxiing, air traffic congestions at respective airports, airports operational efficiency etc).

It is also worth noting that our physical inspection of the scatterplot indicates that the relationship between distance and average speed is not linear. Data may need to be transformed if we were to explain the relationship using linear models.

Which carrier handles the most long haul flights?

According to United Airlines, any distance above 1000 miles can already be considered as long haul flights. This will eliminate some airlines who only specialize in shorter distance destinations, for example, Skywest that mainly facilitates flights from New York to Cleveland.

flights %>% 
  na.omit() %>% 
  filter(distance > 1000) %>% 
  group_by(carrier) %>% 
  summarise(freq = n(), mean_distance=mean(distance)) %>% 
  arrange(desc(freq))
## # A tibble: 14 × 3
##    carrier  freq mean_distance
##    <chr>   <int>         <dbl>
##  1 UA      40608         1900.
##  2 B6      29719         1532.
##  3 DL      27850         1615.
##  4 AA      23084         1599.
##  5 EV       5886         1097.
##  6 VX       5116         2499.
##  7 WN       3765         1556.
##  8 9E       2560         1121.
##  9 US       2240         2143.
## 10 MQ       2188         1059.
## 11 AS        709         2402 
## 12 F9        681         1620 
## 13 HA        342         4983 
## 14 OO          4         1008

United Airlines do the most long-haul flights with mean distance of 1,900 miles (note that this is not UA’s average distance flights, this is just the average distance for flights above 1000 miles for UA). United Airlines operates more than 40,600 long haul flights in 2013, more than any other airlines on the list.