NYC Flights Project - Emilio Sanchez San Martin

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(nycflights13)
library(RColorBrewer)
data(flights)

From finding which varibles to work with (from down below) I realized I wanted to see which flight carriers stick with their scheduled arrival and departure times the most, and find out percentages that show how many minutes early would carriers typically depart from a destination and arrive at a destination (In miniutes). There could be a correlation between best schedules times and the carrier that is with the plane. Finding the average duration of the flights is just something I find intresting and decided to do. I also looked for the interquartile range of the duration of the flights, to estimate the range between different carriers (for fun). (I was going to try and plot this some how but nothing seemed to work, and liked my other plot graph more).

First, I wanted to find out which variables contained NA’s. I used this code in order to figure it out. I used Chat OpenAI to figure out the code for it.

colSums(is.na(flights))

          year          month            day       dep_time sched_dep_time 
             0              0              0           8255              0 
     dep_delay       arr_time sched_arr_time      arr_delay        carrier 
          8255           8713              0           9430              0 
        flight        tailnum         origin           dest       air_time 
             0           2512              0              0           9430 
      distance           hour         minute      time_hour 
             0              0              0              0

Then, I’m going to remove all NA’s from the data set. I’m going to use the !is.na function to do this.

flights_nona <- flights |>
  filter(!is.na(dep_time), !is.na(arr_time), !is.na(dep_delay), !is.na(arr_delay), !is.na(air_time))

To have an idea which variables I should work with, and what I personally want to find and have more interest in, I made a flights2 data set. I won’t use it at all but this is just for me to know what to do next.

flights2 <- flights_nona |>
  select(dep_time, arr_time, dep_delay, arr_delay, air_time, distance, carrier, flight, tailnum, origin, dest)

I’m going to mutate the data set to add a new variable which shows the flight duration in hours. I had to use ChatGPT for this cuz I had no idea what air_time stood for and I realized to find the hours, I had to divide by 60. I tried understanding the air time everywhere and couldn’t haha.

flights_nona <- flights_nona |>
  mutate(duration = air_time / 60)

I want to try and find flights for each carrier, that has followed their flight schedule with no delays with no delays in the flight and arriving on time. I also wanna see for which fights how long in hours did it take it to arrive. This might not be needed, but I am curious.

ontime_flights <- flights_nona |>
  select(dep_delay, arr_delay, carrier, duration) |>
  filter(dep_delay <= 0 & arr_delay <= 0)
head(ontime_flights)

# A tibble: 6 × 4
  dep_delay arr_delay carrier duration
      <dbl>     <dbl> <chr>      <dbl>
1        -1       -18 B6         3.05 
2        -6       -25 DL         1.93 
3        -3       -14 EV         0.883
4        -3        -8 B6         2.33 
5        -2        -2 B6         2.48 
6        -2        -3 B6         2.63

Now that I have my ontime_flights data set, I’m going to try and find the average arrival and delay for each carrier (in minutes). I’m going to use the summarize function to do this. I’m going to try and find the Interquartile Range for the duration of the flights. Since there is a lot of data and maybe some possible outliers, the IQR will help me find the range of the middle 50% of the data. I’m going to use the IQR function to do this.

ontime_flights_average <- ontime_flights |>
  group_by(carrier) |>
  summarize(
    total_flights = n(), #To calc # of flights the carrier had,
    ontime_arr_percent = mean((arr_delay == 0) * 100), #Calc's % of flights arrived earlier in mins
    ontime_depart_percent = mean((dep_delay == 0) * 100 ), #Calc's % of flights departed earlier in mins
    Q1 = quantile(duration, 0.25),
    median = median(duration),
    Q3 = quantile(duration, 0.75),
    IQR = Q3 - Q1) |>
  arrange(desc(ontime_depart_percent), desc(ontime_arr_percent ))
head(ontime_flights_average)

# A tibble: 6 × 8
  carrier total_flights ontime_arr_percent ontime_depart_percent    Q1 median
  <chr>           <int>              <dbl>                 <dbl> <dbl>  <dbl>
1 WN               4498               2.67                 17.2  1.82    1.98
2 VX               2371               2.61                 12.9  5.33    5.57
3 F9                205               4.88                 11.7  3.67    3.8 
4 UA              25041               1.72                  9.87 2.15    3.13
5 DL              26185               2.05                  8.06 1.87    2.4 
6 B6              25504               2.31                  7.18 0.917   2.27
# ℹ 2 more variables: Q3 <dbl>, IQR <dbl>

This code below is for me just to see the amount of flights there was for a carrier in total. I was curious and wanted to compared the flights that arrived and departed on time vs the flights that didn’t. That would be a whole other graph data set, but I’m not going to do that. My main goal is to see the flights that are on time with arrival and depatures, and see on average the amount minutes that planes have arrived/departed early. That way, we can see which carrier tends to arrive earlier and depart earlier.

Testing <- flights_nona |>
  select(carrier) |>
  group_by(carrier) |>
  summarize(
    total_flights = n())

I’m going to round the percentages to 2 decimal places for all variables.

final_flights_average <- ontime_flights_average |>
  mutate(
    ontime_arr_percent = round(ontime_arr_percent, 2),
    ontime_depart_percent = round(ontime_depart_percent, 2),
    Q1 = round(Q1, 2),
    median = round(median, 2),
    Q3 = round(Q3, 2),
    IQR = round(IQR, 2)) |>
  mutate(carrier = fct_reorder(carrier, ontime_depart_percent)) |> 
      arrange(desc(ontime_depart_percent))

final_flights_average$carrier <- factor(final_flights_average$carrier, levels = final_flights_average$carrier[order(final_flights_average$ontime_depart_percent, decreasing = TRUE)]) #Used chatGPT for this, to arrange the order of the carriers by the ontime_depart_percent. I couldn't figure out this code my self, some how I was able to find this code with OpenAI.

head(final_flights_average)

# A tibble: 6 × 8
  carrier total_flights ontime_arr_percent ontime_depart_percent    Q1 median
  <fct>           <int>              <dbl>                 <dbl> <dbl>  <dbl>
1 WN               4498               2.67                 17.2   1.82   1.98
2 VX               2371               2.61                 12.9   5.33   5.57
3 F9                205               4.88                 11.7   3.67   3.8 
4 UA              25041               1.72                  9.87  2.15   3.13
5 DL              26185               2.05                  8.06  1.87   2.4 
6 B6              25504               2.31                  7.18  0.92   2.27
# ℹ 2 more variables: Q3 <dbl>, IQR <dbl>

Now it’s time to plot!

final_graph <- final_flights_average |>
  ggplot(aes(x = carrier)) +
  geom_bar(aes(y = ontime_arr_percent, fill = "On-time Arrival"), stat = "identity", color = "black", alpha = 0.9, position = "dodge") +
  geom_bar(aes(y = ontime_depart_percent, fill = "On-time Departure"), stat = "identity", color = "black", alpha = 0.3, position = "dodge") +
  labs(title = "On-Time Arrival and Departure Percentage by Carrier",
       x = "Carrier",
       y = "Percentage (%)",
       caption = "Source: R Studio NYC Flight's 2013 package",
       fill = "On-Time Arrival/Departure's \nin minutes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(breaks = seq(0, 20, by = 2)) #Used OpenAI for this part

final_graph

Paragraph Below

There were many steps to completing this graph. As I went along, highlighting each step it took to get the graph I got, I first started off with manipulating the NYC Flights 2013 data set to a point where I realized which variables and problem I wanted to solve. I figured that if I wanted to make my project unique, I had to think of something that most people wouldn’t really think about or know how to do. The interest in figuring out the average amount of minutes each carrier have taken to arrive and depart on time from different flights helped me get the variables I needed and that’s when I started manipulating the data. I then used the ggplot function to create a bar graph that shows the percentage of flights that arrived and departed on time for each carrier. I used the geom_bar function to create the bars for the on-time arrival and departure percentages. One aspect about my data set is that I had to use the geom_bar code twice in order to plot down both the arrival AND the departure average times, some how showing the departure times be on time the highest than the arrival times the planes would arrive for each carrier! I used labs in order to title each part of my graph, and changed some settings of the graph to make it transparent and thicker lines. I also changed the x coordinate text to have an angle at 45 for visual appeal. Overall, I am pleased with what my graph demonstrates and love that I was able to find a unique way to show the data set I was working with and made my self! I hope you enjoy my graph as much as I do! Can’t wait to present it.