NYC Flights Homework

Author

Joyce Liang

Data and packages we are using

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data(flights)

Cleaning

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay) &! is.na(air_time))  
flights_nona 
# A tibble: 422,818 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1        1           2038       203      328              3
 2  2023     1     1       18           2300        78      228            135
 3  2023     1     1       31           2344        47      500            426
 4  2023     1     1       33           2140       173      238           2352
 5  2023     1     1       36           2048       228      223           2252
 6  2023     1     1      503            500         3      808            815
 7  2023     1     1      520            510        10      948            949
 8  2023     1     1      524            530        -6      645            710
 9  2023     1     1      537            520        17      926            818
10  2023     1     1      547            545         2      845            852
# ℹ 422,808 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Naming all abbreviations

flights_carrier_replaced <- flights_nona %>%
  mutate(carrier = recode(carrier,
                           "DL" = "Delta",
                           "AA" = "American",
                           "B6" = "JetBlue",
                           "AS" = "Alaska",
                           "UA" = "United",
                           "NK" = "Spirit",
                           "WN" = "Southwest",
                           "YX" = "Republic",
                           "9E" = "Endeavor",
                           "MQ" = "Envoy",
                           "G4" = "Allegiant",
                           "OO" = "Skywest",
                           "F9" = "Frontier",
                          "HA"= "Hawaii"))
# found this code here >>> https://www.geeksforgeeks.org/how-to-replace-multiple-values-in-data-frame-using-dplyr/

flights_origin_replaced <- flights_carrier_replaced %>%
  mutate(origin = recode(origin, 
                         "EWR" = "Newark",
                         "JFK" = "John F. Kennedy",
                         "SMF" = "Sacramento",
                         "ATL" = "Hartsfield-Jackson Atlanta",
                         "BQN" = "Aeropuerto",
                         "CHS" = "Charleston",
                         "DTW" = "Detroit Metropolitan Wayne County",
                         "MIA" = "Miami",
                         "ORD" = "Chicago O'Hare",
                         "IAH" = "George Bush Intercontinental",
                         "FLL" = "Fort Lauderdale-Hollywood",
                         "LGA" = "LaGuardia",
                         "DEN" = "Denver",
                         "MSP" = "Minneapolis–Saint Paul"))

flights_airport_replaced <- flights_origin_replaced %>%
  mutate(dest = recode(dest,
                       "EWR" = "Newark",
                       "JFK" = "John F. Kennedy",
                       "SMF" = "Sacramento",
                       "ATL" = "Hartsfield-Jackson Atlanta",
                       "BQN" = "Aeropuerto",
                       "CHS" = "Charleston",
                       "DTW" = "Detroit Metropolitan Wayne County",
                       "MIA" = "Miami",
                       "ORD" = "Chicago O'Hare",
                       "IAH" = "George Bush Intercontinental",
                       "FLL" = "Fort Lauderdale-Hollywood",
                       "LGA" = "LaGuardia",
                       "DEN" = "Denver",
                       "MSP" = "Minneapolis–Saint Paul",
                       "PBI" = "Palm Beach",
                       "BNA" = "Nashville",
                       "MCO" = "Orlando",
                       "MYR" = "Myrtle",
                       "MDW" = "Chicago Midway",
                       "SEA" = "Seattle–Tacoma",
                       "CLE" = "Cleveland Hopkins",
                       "SJU" = "Luis Munoz Marin",
                       "PHX" = "Phoenix Sky Harbor",
                       "STT" = "Cyril E. King",
                       "TPA" = "Tampa",
                       "PIT" = "Pittsburgh",
                       "CMH" = "John Glenn Columbus",
                       "DAL" = "Dallas Love Field",
                       "RSW" = "Southwest Florida",
                       "SLC" = "Salt Lake City",
                       "STL" = "St. Louis Lambert",
                       "BZN" = "Bozeman Yellowstone",
                       "SFO" = "San Francisco",
                       "DFW" = "Dallas Fort Worth",
                       "DCA" = "Ronald Reagan Washington",
                       "SNA" = "John Wayne", 
                       "BOS" = "Boston Logan",
                       "SAN" = "San Diego",
                       "SAV" = "Savannah/Hilton Head",
                       "SDF" = "Louisville",
                       "CLT" = "Charlotte Douglas",
                       "PWM" = "Portland",
                       "RDU" = "Raleigh-Durham",
                       "SRQ" = "Sarasota Bradenton",
                       "SYR" = "Syracuse Hancock",
                       "MSY" = "Louis Armstrong New Orleans",
                       "AUS" = "Austin-Bergstrom",
                       "BUR" = "Hollywood Burbank",
                       "JAX" = "Jacksonville",
                       "LAS" = "Harry Reid",
                       "EGE" = "Eagle County Regional",
                       "PHL" = "Philadelphia",
                       "CVG" = "Cincinnati/Northern Kentucky",
                       "AVL" = "Asheville Regional",
                       "PSP" = "Palm Springs", 
                       "PDX" = "Portland"
                     ))
# Did not get to all airports

Which carrier has the least delays?

by_carrier <- flights_airport_replaced %>%
  group_by(carrier) |> 
  summarise(count = n(),   
            dist = mean(distance), 
            delay = mean(arr_delay)
            ) 

delay_flights_order <- by_carrier %>%
  arrange(delay) |>
  head(20)
delay_flights_order
# A tibble: 14 × 4
   carrier   count  dist   delay
   <chr>     <int> <dbl>   <dbl>
 1 Allegiant   667  723. -5.88  
 2 Republic  85431  485. -4.64  
 3 Endeavor  52204  487. -2.23  
 4 Alaska     7734 2481.  0.0844
 5 Envoy       354  725.  0.119 
 6 Delta     60364 1278.  1.64  
 7 American  39750 1156.  5.27  
 8 Southwest 12048 1024.  5.76  
 9 United    77438 1246.  9.04  
10 Spirit    14769 1085.  9.89  
11 Skywest    6199  628. 13.7   
12 JetBlue   64280 1140. 15.6   
13 Hawaii      362 4983  21.4   
14 Frontier   1218  969. 26.2   

Visual

library(treemap)
treemap(delay_flights_order, index="carrier", vSize="dist", 
        vColor="delay", type="manual", border.col = c("white"), border.lwds = c(7), title = "Distance Travelled in Relation to Delay Time", title.legend = "Delay", 
        palette="Reds")

## Caption- Pre-built dataset, nycflights23 is a package that contain information about all flights that departed from the three main New York City airports in 2023 and metadata on airlines, airports, weather, and planes.

Write a brief paragraph that describes the visualization you have created and at least one aspect of the plot that you would like to highlight.

The visualization that I have created is a tree map of the relationship between the average distance traveled and the average arrival delay of the top 20 airlines, arranged from earliest to latest arrival. The variables are visualized through 2 components of the tree map: size and color. The average distance is visualized through the size of each box. The bigger the box, the greater the average distance traveled by an airline. Whereas, the average arrival delay is depicted through a color gradient. The colors pertaining to a darker shade indicate that the average arrival delay is later, compared to a lighter color which indicates the average to be an early arrival made by an airline. The highlight of the plot that I found interesting was how, the smaller the average distance (for example Allegiant and Republic), the chances of the airline arriving early seems to be more likely. In comparison, the airline Hawaii is bigger and in red, indicating that although Hawaii airlines had a higher average travel distance, its average arrival delay tends to be late. However, the hypothesis of “the smaller the rectangle, the earlier the airline would arrive,” is immediately shut down because the airline Frontier is also smaller in size, but in dark red, which contradicts this statement, requiring further investigation.