NYFlights

Author

Viktoriia Lyon

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
help(package="nycflights23")
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Filter the dataset to focus on reasonable delays (between -20 and 200 minutes)
flights_clean <- flights %>%
    filter(dep_delay > -20, dep_delay < 200)
ggplot(flights_clean, aes(x = dep_delay, fill = dep_delay < 0)) +
    geom_histogram(binwidth = 5, color = "black") +
    scale_fill_manual(values = c("#AEDFF7", "#90EE90"), name = "Early Departure") +  # Light blue and light green colors
    labs(title = "Distribution of Departure Delays (Under 200 Minutes, 5-Minute Bins)",
         x = "Departure Delay (minutes)",
         y = "Number of Flights",
         caption = "Source: nycflights23 dataset") +
    theme_minimal()

Explanation:

The histogram visualizes the distribution of departure delays for flights from NYC airports in 2023, focusing on delays under 200 minutes, with each bar representing a 5-minute range. The x-axis shows the departure delay in minutes, and the y-axis represents the number of flights. Early departures (negative delays) are represented in light green, while flights that are on time or delayed are shown in light blue. The dataset was filtered using a dplyr command to focus on delays between -20 and 200 minutes, excluding extreme outliers. The bar around 0 is half green and half blue because flights that depart exactly on time (0 minutes delay) are grouped into the same bin as flights that depart slightly early or slightly late. The majority of flights cluster around 0 minutes, indicating that most flights departure very close to scheduled time. As the delay increases beyond 50 minutes, the number of flights significantly decreases, and there are very few flights delayed over 100 minutes.