NYC Flights Homework

NYC Flights Homework

The nycflights13 data set is a compilation of information on various airlines operating out of several New York City airports in 2013. It also includes information on specific flights, aircraft, and weather conditions for the year 2013. These five branches are where the data was gathered. Using this approach of data collection, we can combine many features to perform complicated data analysis as well as work on certain components of the entire enormous data set individually.

# Load standard libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)

This data set contains all 336,776 flights that departed from New York City in 2013.

nycflights13::flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

After a quick review of the data, it is clear that the flights data set consists of 336776 rows and 19 distinct variables. Looking at the dataset’s head, we can see that some flights arrive and depart on the same day, while others only arrive or depart on a certain day.

tail(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     9    30       NA           1842        NA       NA           2019
2  2013     9    30       NA           1455        NA       NA           1634
3  2013     9    30       NA           2200        NA       NA           2312
4  2013     9    30       NA           1210        NA       NA           1330
5  2013     9    30       NA           1159        NA       NA           1344
6  2013     9    30       NA            840        NA       NA           1020
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

From the tail of the data set, we can see that data isn’t sorted, thus I have sorted the data set based on the month and the day of the year.

summary(flights)
      year          month             day           dep_time    sched_dep_time
 Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
 1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
 Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
 Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
 3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
 Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
                                                 NA's   :8255                 
   dep_delay          arr_time    sched_arr_time   arr_delay       
 Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
 1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
 Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
 Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
 3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
 Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
 NA's   :8255      NA's   :8713                  NA's   :9430      
   carrier              flight       tailnum             origin         
 Length:336776      Min.   :   1   Length:336776      Length:336776     
 Class :character   1st Qu.: 553   Class :character   Class :character  
 Mode  :character   Median :1496   Mode  :character   Mode  :character  
                    Mean   :1972                                        
                    3rd Qu.:3465                                        
                    Max.   :8500                                        
                                                                        
     dest              air_time        distance         hour      
 Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
 Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
 Mode  :character   Median :129.0   Median : 872   Median :13.00  
                    Mean   :150.7   Mean   :1040   Mean   :13.18  
                    3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
                    Max.   :695.0   Max.   :4983   Max.   :23.00  
                    NA's   :9430                                  
     minute        time_hour                     
 Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
 1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
 Median :29.00   Median :2013-07-03 10:00:00.00  
 Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
 3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
 Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
                                                 

we could find all flights that departed more than 120 minutes (two hours) late:

deflights <- flights |> 
  filter(dep_delay > 120)
ggplot(deflights, aes(x=carrier, y=dep_delay, color=carrier)) + 
  geom_boxplot() +
  coord_flip() +
  labs(title = "Boxplot of Departure Delays by Carrier in 2013",
       x = "Carrier",
       y = "Departure Delay By Minutes",)

The box plot makes it obvious that the median and quartile ranges for the majority of the airlines are comparable. Outliers are the dots. One item that stands out is the fact that American Eagle Airlines (MQ) and American Airlines (AA) are the two airlines that have much greater delays of more than 1000 minutes compared to other airlines. The majority of the outliers for all carriers are centered below 1000 minutes, which is another characteristic that stands out. Overall, this graph does a wonderful job of displaying all of the carriers’ departure delay durations.

# Remove duplicate rows, if any
flights |> 
  distinct()
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Number of departures getting cancelled
sum(is.na(flights$dep_time))
[1] 8255

As the data has NA, 8255 flight departures were canceled.

# Filter out rows with missing departure times
Total_data <- filter(flights, !is.na(dep_time))

# Group the filtered data by carrier
bycarrier_total <- group_by(Total_data, carrier)

# Calculate the total count of flights for each carrier
sumCarrier_total <- summarize(bycarrier_total, count = n())
# Filter rows where both departure delay and arrival delay are greater than 0
Carrier_data <- filter(flights, dep_delay > 0 & arr_delay > 0)

# Group the filtered data by carrier
bycarrier <- group_by(Carrier_data, carrier)

# Calculate the count of flights meeting the criteria for each carrier
sumCarrier <- summarize(bycarrier, count = n())
# Set column names for the TotalCount data frame
colnames(sumCarrier_total) <- c("carrier", "TotalCount")

# Merge the sumCarrier and sumCarrier_total data frames by the "carrier" column
sumCarrier_final <- merge(x = sumCarrier, y = sumCarrier_total, by = "carrier", all = TRUE)

# Calculate the percentage of delay and add it as a new column
sumCarrier_final$percent_delay <- with(sumCarrier_final, (count / TotalCount) * 100)

# Set up the plot parameters
par(mfrow = c(1, 1))

# Assign unique colors to each carrier
carrier_colors <- rainbow(length(sumCarrier_final$carrier))

# Create a bar plot with colors based on the carrier
barplot(sumCarrier_final$percent_delay,
        main = "Percent Delay by Carrier through 2013",
        xlab = "Carrier",
        ylab = "Percent Delay",
        names.arg = sumCarrier_final$carrier,
        border = "red",
        col = carrier_colors, 
        cex.names = 0.7)

( meanCarrier <- summarize(bycarrier,mean=mean(arr_delay)))
# A tibble: 16 × 2
   carrier  mean
   <chr>   <dbl>
 1 9E       60.6
 2 AA       53.1
 3 AS       45.4
 4 B6       51.6
 5 DL       53.1
 6 EV       58.5
 7 F9       63.6
 8 FL       51.9
 9 HA       72.4
10 MQ       54.6
11 OO       74.6
12 UA       44.4
13 US       45.0
14 VX       57.4
15 WN       47.9
16 YV       62.9
( meanCarrier <- summarize(bycarrier,mean=mean(arr_delay)))
# A tibble: 16 × 2
   carrier  mean
   <chr>   <dbl>
 1 9E       60.6
 2 AA       53.1
 3 AS       45.4
 4 B6       51.6
 5 DL       53.1
 6 EV       58.5
 7 F9       63.6
 8 FL       51.9
 9 HA       72.4
10 MQ       54.6
11 OO       74.6
12 UA       44.4
13 US       45.0
14 VX       57.4
15 WN       47.9
16 YV       62.9
# Create a vector of unique colors for each carrier
carrier_colors <- rainbow(length(meanCarrier$carrier))

# Create the bar plot with colors based on the carrier
barplot(meanCarrier$mean,
        main = "Average Arrival Delay for each Carrier",
        xlab = "Carrier",
        ylab = "Avg. Arrival Delay",
        names.arg = meanCarrier$carrier,
        border = "blue",
        col = carrier_colors,  
        cex.names = 0.7)

The efficiency of the carrier can be evaluated by (1) the proportion of flights operated by a specific carrier that have both arrival and departure delays, and (2) the average arrival time delay for each carrier throughout the course of 2013.

• First, we can see from the graphic (Percent Delay by Carrier through 2013) that carrier FL has the worst performance among the other carriers due to its high delay%. In terms of delay%, Carrier HA performs best.

• Secondly, we see that OO and HA have longer arrival delays than the other carriers when we look at the graphic (Average Arrival Delay for each Carrier). The US carriers, including UA, perform the best from this angle. I have thought about