The nycflights13 data set is a compilation of information on various airlines operating out of several New York City airports in 2013. It also includes information on specific flights, aircraft, and weather conditions for the year 2013. These five branches are where the data was gathered. Using this approach of data collection, we can combine many features to perform complicated data analysis as well as work on certain components of the entire enormous data set individually.
# Load standard librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
This data set contains all 336,776 flights that departed from New York City in 2013.
After a quick review of the data, it is clear that the flights data set consists of 336776 rows and 19 distinct variables. Looking at the dataset’s head, we can see that some flights arrive and depart on the same day, while others only arrive or depart on a certain day.
tail(flights)
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 NA 1842 NA NA 2019
2 2013 9 30 NA 1455 NA NA 1634
3 2013 9 30 NA 2200 NA NA 2312
4 2013 9 30 NA 1210 NA NA 1330
5 2013 9 30 NA 1159 NA NA 1344
6 2013 9 30 NA 840 NA NA 1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
From the tail of the data set, we can see that data isn’t sorted, thus I have sorted the data set based on the month and the day of the year.
summary(flights)
year month day dep_time sched_dep_time
Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
NA's :8255
dep_delay arr_time sched_arr_time arr_delay
Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
Median : -2.00 Median :1535 Median :1556 Median : -5.000
Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
NA's :8255 NA's :8713 NA's :9430
carrier flight tailnum origin
Length:336776 Min. : 1 Length:336776 Length:336776
Class :character 1st Qu.: 553 Class :character Class :character
Mode :character Median :1496 Mode :character Mode :character
Mean :1972
3rd Qu.:3465
Max. :8500
dest air_time distance hour
Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
Mode :character Median :129.0 Median : 872 Median :13.00
Mean :150.7 Mean :1040 Mean :13.18
3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
Max. :695.0 Max. :4983 Max. :23.00
NA's :9430
minute time_hour
Min. : 0.00 Min. :2013-01-01 05:00:00.00
1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00.00
Median :29.00 Median :2013-07-03 10:00:00.00
Mean :26.23 Mean :2013-07-03 05:22:54.64
3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00.00
Max. :59.00 Max. :2013-12-31 23:00:00.00
we could find all flights that departed more than 120 minutes (two hours) late:
deflights <- flights |>filter(dep_delay >120)
ggplot(deflights, aes(x=carrier, y=dep_delay, color=carrier)) +geom_boxplot() +coord_flip() +labs(title ="Boxplot of Departure Delays by Carrier in 2013",x ="Carrier",y ="Departure Delay By Minutes",)
The box plot makes it obvious that the median and quartile ranges for the majority of the airlines are comparable. Outliers are the dots. One item that stands out is the fact that American Eagle Airlines (MQ) and American Airlines (AA) are the two airlines that have much greater delays of more than 1000 minutes compared to other airlines. The majority of the outliers for all carriers are centered below 1000 minutes, which is another characteristic that stands out. Overall, this graph does a wonderful job of displaying all of the carriers’ departure delay durations.
# Remove duplicate rows, if anyflights |>distinct()
# Number of departures getting cancelledsum(is.na(flights$dep_time))
[1] 8255
As the data has NA, 8255 flight departures were canceled.
# Filter out rows with missing departure timesTotal_data <-filter(flights, !is.na(dep_time))# Group the filtered data by carrierbycarrier_total <-group_by(Total_data, carrier)# Calculate the total count of flights for each carriersumCarrier_total <-summarize(bycarrier_total, count =n())
# Filter rows where both departure delay and arrival delay are greater than 0Carrier_data <-filter(flights, dep_delay >0& arr_delay >0)# Group the filtered data by carrierbycarrier <-group_by(Carrier_data, carrier)# Calculate the count of flights meeting the criteria for each carriersumCarrier <-summarize(bycarrier, count =n())
# Set column names for the TotalCount data framecolnames(sumCarrier_total) <-c("carrier", "TotalCount")# Merge the sumCarrier and sumCarrier_total data frames by the "carrier" columnsumCarrier_final <-merge(x = sumCarrier, y = sumCarrier_total, by ="carrier", all =TRUE)# Calculate the percentage of delay and add it as a new columnsumCarrier_final$percent_delay <-with(sumCarrier_final, (count / TotalCount) *100)# Set up the plot parameterspar(mfrow =c(1, 1))# Assign unique colors to each carriercarrier_colors <-rainbow(length(sumCarrier_final$carrier))# Create a bar plot with colors based on the carrierbarplot(sumCarrier_final$percent_delay,main ="Percent Delay by Carrier through 2013",xlab ="Carrier",ylab ="Percent Delay",names.arg = sumCarrier_final$carrier,border ="red",col = carrier_colors, cex.names =0.7)
# A tibble: 16 × 2
carrier mean
<chr> <dbl>
1 9E 60.6
2 AA 53.1
3 AS 45.4
4 B6 51.6
5 DL 53.1
6 EV 58.5
7 F9 63.6
8 FL 51.9
9 HA 72.4
10 MQ 54.6
11 OO 74.6
12 UA 44.4
13 US 45.0
14 VX 57.4
15 WN 47.9
16 YV 62.9
# Create a vector of unique colors for each carriercarrier_colors <-rainbow(length(meanCarrier$carrier))# Create the bar plot with colors based on the carrierbarplot(meanCarrier$mean,main ="Average Arrival Delay for each Carrier",xlab ="Carrier",ylab ="Avg. Arrival Delay",names.arg = meanCarrier$carrier,border ="blue",col = carrier_colors, cex.names =0.7)
The efficiency of the carrier can be evaluated by (1) the proportion of flights operated by a specific carrier that have both arrival and departure delays, and (2) the average arrival time delay for each carrier throughout the course of 2013.
• First, we can see from the graphic (Percent Delay by Carrier through 2013) that carrier FL has the worst performance among the other carriers due to its high delay%. In terms of delay%, Carrier HA performs best.
• Secondly, we see that OO and HA have longer arrival delays than the other carriers when we look at the graphic (Average Arrival Delay for each Carrier). The US carriers, including UA, perform the best from this angle. I have thought about