NYC Flights HW

Author

Erin Morrison

Load Libraries & Data

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
Warning: package 'nycflights23' was built under R version 4.4.2
data(flights)

Prep Data

# show summary to determine how to divide distance variable
summary(flights$distance)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   80.0   479.0   762.0   977.5  1182.0  4983.0 
# add a categorical distance variable 
flights2 <- flights |>
  mutate(dist_cat = case_when(
    (distance <= 479.0) ~ "short", 
    (distance <= 762.0) ~ "medium short",
    (distance <= 1182.0) ~ "medium long",
    (distance > 1182.0) ~ "long"))

# orders dist_cat
flights2$dist_cat <- factor(flights2$dist_cat,
    levels = c("short", "medium short", "medium long", "long"), ordered = TRUE)

# add a variable of distance/air_time
flights2 <- flights2 |>
  mutate(ratio = flights$distance/flights$air_time)

Find the best and worst fligts in each catagory

# used to compare carriers within categories NOT FINAL VISUALIZATION
flights2 |> 
  ggplot(aes(carrier, ratio)) +
  geom_boxplot() +
  facet_grid(~dist_cat) + 
    theme(axis.text.x = element_text(angle = 90))
Warning: Removed 12534 rows containing non-finite outside the scale range
(`stat_boxplot()`).

# used to ensure each carrier was actually was the best or worst in their categories 
# (instead of guessing only based off the box plot, I used this code outline to check close calls in all categories)
flights2 |> 
  filter(carrier == "DL" & dist_cat == "short") |>
  summary()
      year          month             day           dep_time    sched_dep_time
 Min.   :2023   Min.   : 1.000   Min.   : 1.00   Min.   :   3   Min.   : 600  
 1st Qu.:2023   1st Qu.: 3.000   1st Qu.: 8.00   1st Qu.:1347   1st Qu.:1512  
 Median :2023   Median : 5.000   Median :16.00   Median :2024   Median :2000  
 Mean   :2023   Mean   : 5.612   Mean   :15.77   Mean   :1752   Mean   :1774  
 3rd Qu.:2023   3rd Qu.: 8.000   3rd Qu.:23.00   3rd Qu.:2155   3rd Qu.:2159  
 Max.   :2023   Max.   :12.000   Max.   :31.00   Max.   :2358   Max.   :2256  
                                                 NA's   :27                   
   dep_delay         arr_time    sched_arr_time   arr_delay     
 Min.   :-17.00   Min.   :   1   Min.   :  27   Min.   :-51.00  
 1st Qu.: -4.00   1st Qu.:1337   1st Qu.:1616   1st Qu.:-15.00  
 Median : -1.00   Median :2012   Median :2119   Median : -4.00  
 Mean   : 25.57   Mean   :1701   Mean   :1914   Mean   : 18.42  
 3rd Qu.: 24.00   3rd Qu.:2254   3rd Qu.:2308   3rd Qu.: 23.00  
 Max.   :975.00   Max.   :2400   Max.   :2359   Max.   :984.00  
 NA's   :27       NA's   :29                    NA's   :31      
   carrier              flight         tailnum             origin         
 Length:1693        Min.   :  96.0   Length:1693        Length:1693       
 Class :character   1st Qu.: 409.0   Class :character   Class :character  
 Mode  :character   Median : 610.0   Mode  :character   Mode  :character  
                    Mean   : 610.4                                        
                    3rd Qu.: 807.0                                        
                    Max.   :1087.0                                        
                                                                          
     dest              air_time         distance          hour      
 Length:1693        Min.   : 30.00   Min.   :184.0   Min.   : 6.00  
 Class :character   1st Qu.: 37.00   1st Qu.:184.0   1st Qu.:15.00  
 Mode  :character   Median : 41.00   Median :187.0   Median :20.00  
                    Mean   : 43.06   Mean   :212.9   Mean   :17.47  
                    3rd Qu.: 44.75   3rd Qu.:187.0   3rd Qu.:21.00  
                    Max.   :164.00   Max.   :431.0   Max.   :22.00  
                    NA's   :31                                      
     minute        time_hour                              dist_cat   
 Min.   : 0.00   Min.   :2023-01-01 14:00:00.00   short       :1693  
 1st Qu.: 0.00   1st Qu.:2023-03-03 13:00:00.00   medium short:   0  
 Median :29.00   Median :2023-05-19 21:00:00.00   medium long :   0  
 Mean   :27.66   Mean   :2023-06-05 00:08:40.97   long        :   0  
 3rd Qu.:55.00   3rd Qu.:2023-08-27 21:00:00.00                      
 Max.   :59.00   Max.   :2023-12-31 20:00:00.00                      
                                                                     
     ratio      
 Min.   :1.207  
 1st Qu.:4.452  
 Median :4.842  
 Mean   :4.941  
 3rd Qu.:5.412  
 Max.   :7.167  
 NA's   :31     
# makes a data set with only the best and worst carries in each category
bw <- flights2 |>
  filter(((carrier == "HA" | carrier == "YX") & dist_cat == "long") | 
           ((carrier == "OO" | carrier == "G4") & dist_cat == "medium long") |
           ((carrier == "OO" | carrier == "F9") & dist_cat == "medium short") |
           ((carrier == "NK" | carrier == "DL") & dist_cat == "short"))

Final Visualization

flights2 |>
  ggplot(aes(dist_cat, ratio)) +
  geom_jitter(data = bw, aes(dist_cat, ratio, color = carrier), alpha = 0.2) +
  geom_boxplot(alpha = 0.5) + 
  labs(y = "Average Speed (miles/min)",
       x = "Distance (miles)",
    title = "NY Flight Speed by Distance 2023",
    caption = "The fastest and slowest airlines are plotted for each distance category\nSource: FAA Aircraft registry",
    color = "Airline Carrier") + 
   scale_colour_discrete(labels = c("HA" = "Hawaiian Airlines", "YX" = "Republic Airlines", 
                                    "OO" = "Skywest Airlines", "G4" = "Allegiant Air LLC",
                                    "F9" = "Frontier Airlines", "NK" = "Spirit Airlines", 
                                    "DL" = "Delta Air Lines")) +
  theme_light() +
  scale_x_discrete(labels = c("short" = "80.0 - 479.0", "medium short" = "479.1 - 762.0", 
                              "medium long" = "762.1 - 1182.0", "long" = "1182.1 - 4983.0"))
Warning: Removed 12534 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 330 rows containing missing values or values outside the scale range
(`geom_point()`).

Reflection

I created side by side boxplots that are separated based on the distance in miles of each flight, each box represents 25% of all of the observations. The y-axis shows the plane’s average speed in miles per minute that I calculated based on air time divided by distance. Each box also has the “best” and “worst” airline carrier for that category plotted as points through geom_jitter. The “best” airlines have the fastest calculated speed while the “worst” have the slowest. If I were to improve this plot I would make the key have no opacity so that it is easier for viewers to see the colors. I want to highlight that Skywest Airlines came as slowest for 2 categories and didn’t perform well in the other two even if that is not shown on the graph. I would also like to point out the general upward trend in speed as flights get longer. That is why I split the boxes into categories because I don’t think it would be fair to compare airlines that do majority short flights to those that do majority long ones. For example Hawaiian Airlines only does long flights.