NYCflights23

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
Warning: package 'nycflights23' was built under R version 4.4.3
data(flights)
data(airlines)
flights
# A tibble: 435,352 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1        1           2038       203      328              3
 2  2023     1     1       18           2300        78      228            135
 3  2023     1     1       31           2344        47      500            426
 4  2023     1     1       33           2140       173      238           2352
 5  2023     1     1       36           2048       228      223           2252
 6  2023     1     1      503            500         3      808            815
 7  2023     1     1      520            510        10      948            949
 8  2023     1     1      524            530        -6      645            710
 9  2023     1     1      537            520        17      926            818
10  2023     1     1      547            545         2      845            852
# ℹ 435,342 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
flights2 <- left_join(flights, airlines, by = "carrier")
flights2$name <- gsub("Inc\\.|Co\\.", "", flights2$name)
newark <- flights2 |>
  filter(origin == "EWR")
#united_hubs <- newark |>
  #filter(carrier == "UA")
  #filter(dest == "ORD" | dest == "DEN" | dest == "IAH" | dest == "LAX" | dest == "SFO" | dest == "IAD")
#united_hubs <- united_hubs |>
  #filter(dest == "ORD" | dest == "DEN" | dest == "IAH" | dest == "LAX" | dest == "SFO" | dest == "IAD")
#united_hubs <- united_hubs |>
  #group_by(dest)
ggplot(newark, aes(x = name, y = distance, fill = name)) +
  geom_boxplot() +
  labs(x = "Airlines",
       y = "Distance (Miles)",
       fill = "Airline",
       title = "Route Distance Distribution by Airline at Newark",
       caption = "Source: FAA") +
  scale_fill_brewer(palette = "Paired") +
  coord_flip() +
  theme_minimal()

Essay

I decided to create a quite basic visualization because I believed that even thought it was simple it would reveal a lot. My graph is a series of boxplots showing the distance distribution of flights from Newark only by airline. I decided to do Newark because I like it more, and I needed to use dplyr commands. I find this graph interesting because you can use it to learn what kind of operations each of these airlines has at Newark. Skywest, Republic, Envoy and Endeavor are all regional airlines that contract out with some of the bigger airlines to operate their shorter and less in demand routes. You can see this because their range of distances is shorter and the median distance overall is shorter then the other airlines. The next thing you can see are some of the other major airlines, Allegiant, American, Delta and JetBlue. They don’t have hubs at Newark, but since it serves the New York City Metro area, the countries biggest market, they still have to provide flights to be competitive and provide what their customers want. Because of this desire, they will typically only have a few flights to their other hubs across the country, which are further away, so their distributions have few shorter flights. In Alligant’s case the minimum, first quartile and median are extremely close together. Then you have Alaska Airlines, which has a slightly different business model compared to the rest of the major US airlines, which is primarily long-haul trans-continental flights. That is why their median is so high and the range so small, beacsue they only serve a few destinations that are all similarly far away (Seattle, San Francisco, etc). Finally we get to the 2 airlines that use Newark as a major hub, Spirit and United. That is why they have much larger ranges then the other airlines, as they serve a much wider range of destinations then the other airlines. Also in United’s case, you can see they are the only airline at Newark to serve Alaska and Hawaii, which are represented by the 3 outliers that are further then any other airlines destinations.