The dataset we are analyzing today is the NYC Flights dataset, which is part of the Tidyverse package. Firstly, we load the dataset, inspect its overall structure and look through the first six rows of the data.
library(nycflights13)
require(nycflights13)
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
This is a very big dataset with 32,735 rows and 16 columns. It contains the details of both inbound and outbound flights to and out of three major New York City airports, which are JFK, La Guardia, as well as and Newark Liberty in New Jersey, for the entirety of the year 2013.
We are able to obtain many interesting insights from the dataset, one in particular is related to departure and arrival delays at the airport. Let’s look at the the distribution of departure delays out of NYC’s largest airport. We can start by subsetting the data for flights that are only flying out of JFK.
jfk_origin <- flights[flights$origin %in% c("JFK"), ]
dim(jfk_origin)
## [1] 111279 19
We find that about 1/3 of the dataset originate from JFK. Next, lets look at the disribution of departure delays in our new subsetted dataset. We are interested in the shape of the distribution and basic descriptive statistics on the dataset, for which purpose we will also exclude empty values from the calculations.
library(ggplot2)
ggplot(data=na.omit(jfk_origin), aes(x=dep_delay))+
geom_histogram(bins=50)+
labs(title = "Histogram of JFK airport departure delays",
subtitle = "(all units in Mins)",
x = "Mins",
y = "Frequency")
missing_values<- is.na(jfk_origin$dep_delay)
sum(missing_values)
## [1] 1863
mode_jfk <- sort(table(jfk_origin$dep_delay), decreasing=TRUE)[1]
cat("Range:", range(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Range: -43 1301
cat("Mean:", mean(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Mean: 12.11216
cat("Median:", median(jfk_origin$dep_delay, na.rm=TRUE), "\n")
## Median: -1
cat("Mode value:", names(mode_jfk), "freq: ", mode_jfk)
## Mode value: -3 freq: 9042
mode <- sort(table(jfk_origin$dep_delay), decreasing=TRUE)[1]
From the histogram of the departure delays data, we are also able to gather the following insights:
The distribution is skewed to the right, indicating the presence of outliers on the right side of the distribution that potentially skew the mean. However, most flights tend to center around the median, which is approximately -1. This suggests that in this distribution, the mean is greater than the median, which is in turn greater than the mode.
Departure delays are not the only observation in the data; there are also departures that leave ahead of the scheduled time. This is reflected in the negative data values. For example, the absolute minimum departure delay out of JFK is only 17 minutes early, and the data’s median is actually -1. On the other hand, the absolute maximum departure delay is just over 1,300 minutes, which is almost 22 hours!
Let’s look at which carrier tend to be the worst performer for punctuality
library(dplyr) # Load the dplyr package
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
jfk_carrier<-jfk_origin %>%
group_by(carrier) %>%
summarise(
mean_delay = mean(dep_delay, na.rm = TRUE),
std_dev_delay = sd(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE)
) %>%
arrange(desc(mean_delay))
jfk_flight_freq <- as.data.frame(table(jfk_origin$carrier))
jfk_flight_freq <- jfk_flight_freq[order(jfk_flight_freq$Freq), ]
names(jfk_flight_freq) <- c("carrier", "freq")
merged_df<- merge(jfk_carrier, jfk_flight_freq, by.x = 'carrier', all.x = TRUE)
merged_df[order(-merged_df$mean_delay), ]
## carrier mean_delay std_dev_delay median_delay freq
## 1 9E 19.001517 48.08512 -1 14651
## 5 EV 18.520362 56.04845 -2 1408
## 10 VX 13.279441 44.86116 0 3596
## 7 MQ 13.199971 43.87308 -2 7193
## 3 B6 12.757453 36.12452 -1 42076
## 2 AA 10.302155 37.98175 -2 13783
## 4 DL 8.333188 35.16894 -2 20701
## 8 UA 7.900000 32.61058 -2 4534
## 9 US 5.866959 29.30178 -2 2995
## 6 HA 4.900585 74.10990 -4 342
When we rank the carriers by mean of their delayed departures, 9E (Endeavor Air) comes up on top with a mean of 19 minutes departure delays.
Meanwhile on the bottom of the chart we have HA (Hawaiian Airlines) which have both the lowest mean and median of departure delays in the dataset. However, HA also has the lowest amount of flights (342 flights during the year, which average to slightly below one flight everyday), which makes it mean susceptible to outliers. This is also supported by the fact that this airline also has the highest level of spread measure (standard deviation) in the dataset.
As we have established earlier that the shape of the distribution is right skewed, it is adviseable to use median as a measure of centredness in order to keep the data robust from outliers. Note that this problem is not only isolated to the mean, as standard deviation as measure of dispersion is also not robust from outliers. As such, IQR (interquartile range) would probably be a better measure of dispersion compared to standard deviation.
Which carrier is the most reliable and is subject to the least amount of outlier flights? The answer appears to be (UA) United Airlines, US (US Airways) which have among the lowest variability, the lowest mean and median lower than the population median as well.
Is there any particular month that worried NYC travelers should avoid?
jfk_origin %>%
group_by(month) %>%
summarise(mean_delay=mean(dep_delay, na.rm=TRUE), std_dev_delay=sd(dep_delay, na.rm=TRUE), median_delay=median(dep_delay, na.rm=TRUE), iqr=IQR(dep_delay, na.rm=TRUE), frequency= n()) %>%
arrange(desc(mean_delay))
## # A tibble: 12 × 6
## month mean_delay std_dev_delay median_delay iqr frequency
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 7 23.8 53.3 1 30 10023
## 2 6 20.5 50.2 0 25 9472
## 3 12 14.8 39.1 0 19 9146
## 4 8 12.9 36.3 -1 16 9983
## 5 5 12.5 38.5 -1 15 9397
## 6 4 12.2 41.2 -2 14 9218
## 7 2 11.8 37.4 -1 16 8421
## 8 3 10.7 35.3 -2 14 9697
## 9 1 8.62 36.0 -2 10 9161
## 10 9 6.64 32.5 -3 8 8908
## 11 11 4.68 27.1 -3 7 8710
## 12 10 4.59 25.2 -3 7 9143
Perhaps not very surprising result, but the summer holidays (June-July) seem to be the months with highest mean and median for delay, among the highest variability of delays (which is also explained by among the highest interquartile range as well). This is followed by December, another popular holiday month. In terms of frequency of flights throughout the month, there are also more flights happening during the top 3 months (June, July, December) compared to the rest of the year i.e. 44% of all flights during the year happens during these top 3 months.
Let’s look at the breakdown of departure delay data based on the months:
ggplot(jfk_origin, aes(x=factor(month), y=dep_delay))+
geom_boxplot()
## Warning: Removed 1863 rows containing non-finite values (`stat_boxplot()`).
If we look at the boxplots by month, it is clear that based on the
position of the box, the median is higher for the holiday months June
and July, as well as December, and the variability is also the highest
for the holiday months. Meanwhile only in January and June do we see
significant outliers of departure delays, although the number of flights
are not significantly more during these months.
Let’s get back to the full dataset now and try to answer this question. We start by creating a new column we will call dep_type. This column will iterate through each row and seek to categorize departures into two types i.e. ‘Delayed’ or ‘On Time’. We will define any flights that still manage to depart within 5 minutes of the scheduled time to be ‘On Time’, otherwise it will be ‘Delayed’
flights<- flights %>%
mutate(dep_type = ifelse(dep_delay<5, "On Time", "Delayed"))
head(flights)
## # A tibble: 6 × 20
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_type <chr>
We will now look at which airports have the highest and lowest On Time percentage among the ones presented in our dataset. This is defined as a percentage calculated by ‘On Time’ / (‘On Time’ + ‘Delayed’))
flights %>%
na.omit() %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "On Time") / n()) %>%
arrange(desc(ot_dep_rate))
## # A tibble: 3 × 2
## origin ot_dep_rate
## <chr> <dbl>
## 1 LGA 0.729
## 2 JFK 0.691
## 3 EWR 0.640
From data above, La Guardia Airport in New York comes up at the top for the highest On Time departure rate, at 73% of all total flights in 2013. Meanwhile, New Jersey’s Newark airport is ranked the lowest at 64%.
Let’s look at some visualization for airports and departure types.
ggplot(data=flights, aes(x=origin, fill=dep_type))+
geom_bar()
From the chart above, we can see that EWR does have the most amount of
flights, and it also has the largest proportion of delayed flights out
of the three US airports.
flights %>%
na.omit() %>%
group_by(origin) %>%
summarise(no_of_flights=n()) %>%
arrange(desc(no_of_flights))
## # A tibble: 3 × 2
## origin no_of_flights
## <chr> <int>
## 1 EWR 117127
## 2 JFK 109079
## 3 LGA 101140
That’s on the departure side of things. How about from the arrival side? An airport’s efficiency should be judged based on both the average departure delay as well as arrival delay times. We create a new column called arr_type, in which any flights with arrival delay of 0 and lower is ‘On Time’, and is otherwise ‘Delayed’.
flights<- flights %>%
mutate(arr_type = ifelse(arr_delay<=0, "On Time", "Delayed"))
flights %>%
na.omit() %>%
group_by(origin) %>%
summarise(arr_rate = sum(arr_type == "On Time") / n()) %>%
arrange(desc(arr_rate))
## # A tibble: 3 × 2
## origin arr_rate
## <chr> <dbl>
## 1 JFK 0.607
## 2 LGA 0.604
## 3 EWR 0.572
The result is the same as the departure delay analysis. The airport with the highest On Time percentage on arrival times is still La Guardia airport, followed very closely by JFK. New Jersey’s Newark’s airport comes at more than 300bps margin at third place.
The conclusion remains that worried travelers should prioritize flying out of La Guardia to minimize any risk of delay.
Another insight we can gather from the dataset is an insight into the average speed for each flight. We can start by creating a new column called average_speed, and this column will divide distance by air time.
Since this is an American data, we understand that distance is stated in miles. The first step would be to convert the air time (which is stated in minutes), to hour, and then we can work out average_speed in miles per hour unit.
#Convert air time to hours by creating a new column called air_time_hour
flights$air_time_hour <- round(flights$air_time/60,1)
#Create a new column called average_speed by dividing distance in miles by air_time_hour
flights <- flights %>%
na.omit() %>%
mutate(average_speed = distance / air_time_hour)
#summarise the information by carrier
flights %>%
arrange(desc(average_speed)) %>%
select(average_speed, tailnum, carrier)
## # A tibble: 327,346 × 3
## average_speed tailnum carrier
## <dbl> <chr> <chr>
## 1 693. N666DN DL
## 2 660 N14568 EV
## 3 630 N17196 EV
## 4 623. N12567 EV
## 5 613. N947UW US
## 6 575 N956DL DL
## 7 571. N3768 DL
## 8 569 N11544 EV
## 9 560. N5FFAA AA
## 10 551. N3762Y DL
## # ℹ 327,336 more rows
From the table above, we see that tailnumber N666DN which belongs to Delta Airlines travels at the highest average speed of 693 miles per hour. A lot of factors could influence an aircraft’s average speed, among which is distance as well as altitude. As we do have the distance data, let’s visualize the relationship between average speed as well as distance traveled:
plot(flights$average_speed, flights$distance)
From the above visualization we could see a positive relationship
between distance as well as average speed with a seemingly natural
maximum at around 500-600 miles per hour (although we do have an outlier
at 700 miles per hour). In general, the higher the distance traveled,
the higher is the average speed. This may be explained by different
types of carrier being used for long haul flights (e.g. type of aircraft
used may be larger and they may travel faster), altitude as well as air
traffic and airports (taxiing, air traffic congestions at respective
airports, airports operational efficiency etc).
It is also worth noting that our physical inspection of the scatterplot indicates that the relationship between distance and average speed is not linear. Data may need to be transformed if we were to explain the relationship using linear models.
According to United Airlines, any distance above 1000 miles can already be considered as long haul flights. This will eliminate some airlines who only specialize in shorter distance destinations, for example, Skywest that mainly facilitates flights from New York to Cleveland.
flights %>%
na.omit() %>%
filter(distance > 1000) %>%
group_by(carrier) %>%
summarise(freq = n(), mean_distance=mean(distance)) %>%
arrange(desc(freq))
## # A tibble: 14 × 3
## carrier freq mean_distance
## <chr> <int> <dbl>
## 1 UA 40608 1900.
## 2 B6 29719 1532.
## 3 DL 27850 1615.
## 4 AA 23084 1599.
## 5 EV 5886 1097.
## 6 VX 5116 2499.
## 7 WN 3765 1556.
## 8 9E 2560 1121.
## 9 US 2240 2143.
## 10 MQ 2188 1059.
## 11 AS 709 2402
## 12 F9 681 1620
## 13 HA 342 4983
## 14 OO 4 1008
United Airlines do the most long-haul flights with mean distance of 1,900 miles (note that this is not UA’s average distance flights, this is just the average distance for flights above 1000 miles for UA). United Airlines operates more than 40,600 long haul flights in 2013, more than any other airlines on the list.
The most popular destination (origin and destination) from each origin for each of the 3 airports are given follows:
popular_destinations <- flights %>%
na.omit() %>%
group_by(origin, dest) %>%
summarise(n_flights = n()) %>%
arrange(desc(n_flights)) %>%
slice_head(n=10)
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
popular_destinations
## # A tibble: 30 × 3
## # Groups: origin [3]
## origin dest n_flights
## <chr> <chr> <int>
## 1 EWR ORD 5828
## 2 EWR BOS 5247
## 3 EWR SFO 5064
## 4 EWR CLT 4893
## 5 EWR MCO 4893
## 6 EWR ATL 4876
## 7 EWR LAX 4867
## 8 EWR IAH 3923
## 9 EWR FLL 3740
## 10 EWR DFW 3014
## # ℹ 20 more rows
Summary:
Out of Newark airport, flights from New Jersey to Chicago (ORD) is the most popular at ~5,800, followed by flights from New Jersey to Boston at ~5,200
Out of JFK airport, flights from New York to LA is the most popular with more than 11,000 flights in 2013 alone. This route also has the highest amount of flights during the year among the three United States airport. In fact, the LA - New York route is almost double the flights from New Jersey to Chicago. Second most popular route from JFK is flight from New York to San Fransisco (more than 8,100 flights in 2013)
Out of La Guardia, flights from LA to Atlanta is the most popular at more than 10,000 flights in 2013, followed by flight from New York to Chicago at 8,507 flights
If we were to aggregate the data for the three airports, what would be the most popular destinations with New York as origin?
flights %>%
na.omit() %>%
group_by(dest) %>%
summarise(n_flights=n()) %>%
arrange(desc(n_flights))
## # A tibble: 104 × 2
## dest n_flights
## <chr> <int>
## 1 ATL 16837
## 2 ORD 16566
## 3 LAX 16026
## 4 BOS 15022
## 5 MCO 13967
## 6 CLT 13674
## 7 SFO 13173
## 8 FLL 11897
## 9 MIA 11593
## 10 DCA 9111
## # ℹ 94 more rows
Atlanta and Chicago are two most popular flight destinations from all three Greater New York airports with combined more than 33,000 flights in 2013.