NYCFlights13 Data Visualization

Author

Michael Desir

import libraries and data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(dplyr)
library(RColorBrewer)

data organization and filtering

by_tailnum <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay)) |>
  group_by(tailnum) |>  # groups tailnumbers
  summarise(total_flights = n(),   # counts totals for each tailnumber
            mean_air_time = mean(air_time), # calcs mean air-time
            mean_distance = mean(distance), # calcs mean distance
            mean_delay = mean(dep_delay) # calcs mean delay
            )
head(by_tailnum)
# A tibble: 6 × 5
  tailnum total_flights mean_air_time mean_distance mean_delay
  <chr>           <int>         <dbl>         <dbl>      <dbl>
1 D942DN              4         135.           854.      31.5 
2 N0EGMQ            352         104.           679.       8.51
3 N10156            145         115.           756.      18.0 
4 N102UW             48          82.8          536.       8   
5 N103US             46          83.3          535.      -3.20
6 N104UW             46          81.3          535.      10.1 
delay <- filter(by_tailnum, mean_delay > 0, total_flights > 20, mean_distance > 1600)

create plot

ggplot(delay, aes(mean_air_time, mean_delay)) +
  geom_point(aes(size = total_flights, colour=mean_distance), alpha = .3) +
  geom_smooth() +
  scale_size_area() +
  theme_bw() +
  labs(x = "Average Flight Time (minutes)",
       y = "Average Departure Delay (minutes)",
       caption = "FAA Aircraft registry",
       title = "Average Flight Time and Average Departure Delays | Flights from NY")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This dataset is big enough to be useful, but I wish it had some more. For example, grouping by tail numbers is OK, but grouping by anonymous pilot IDs would’ve been much more useful. In the future, it might be interesting to compare arrival delays and departure delays with the same tail number, and see if pilots try to speed up their flights when they’re running late. When looking at my data visualization, I noticed that average flight times tend to group around 230 minutes and 320 minutes. Excepting outliers, delays tended to be under 15 minutes. What I found odd was how few long-distance flights there were. 320 minutes is only 5 and a third hours, which wouldn’t even get you to London. Either this means that New York had very little international flights, or more interestingly, long-distance flights tended to be on time more often. That would be an interesting analysis to make.