NYC Flights

Author

Paul D-O

Load the libraries and view the “flights” dataset

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data("flights") # load the flights dataset into my global environment.

Examine the data

We can View data at any time by clicking on its table icon in the Environment tab in the Grid view.

Alternatively, you can use code we learned in the last unit - head(data). Notice the variable names and types.

head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Run summary of a data frame, calculating mean, median and quartile values for continuous variables:

summary(flights)
      year          month             day           dep_time     sched_dep_time
 Min.   :2023   Min.   : 1.000   Min.   : 1.00   Min.   :   1    Min.   : 500  
 1st Qu.:2023   1st Qu.: 3.000   1st Qu.: 8.00   1st Qu.: 931    1st Qu.: 930  
 Median :2023   Median : 6.000   Median :16.00   Median :1357    Median :1359  
 Mean   :2023   Mean   : 6.423   Mean   :15.74   Mean   :1366    Mean   :1364  
 3rd Qu.:2023   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:1804    3rd Qu.:1759  
 Max.   :2023   Max.   :12.000   Max.   :31.00   Max.   :2400    Max.   :2359  
                                                 NA's   :10738                 
   dep_delay          arr_time     sched_arr_time   arr_delay       
 Min.   : -50.00   Min.   :   1    Min.   :   1   Min.   : -97.000  
 1st Qu.:  -6.00   1st Qu.:1105    1st Qu.:1135   1st Qu.: -22.000  
 Median :  -2.00   Median :1519    Median :1551   Median : -10.000  
 Mean   :  13.84   Mean   :1497    Mean   :1552   Mean   :   4.345  
 3rd Qu.:  10.00   3rd Qu.:1946    3rd Qu.:2007   3rd Qu.:   9.000  
 Max.   :1813.00   Max.   :2400    Max.   :2359   Max.   :1812.000  
 NA's   :10738     NA's   :11453                  NA's   :12534     
   carrier              flight         tailnum             origin         
 Length:435352      Min.   :   1.0   Length:435352      Length:435352     
 Class :character   1st Qu.: 364.0   Class :character   Class :character  
 Mode  :character   Median : 734.0   Mode  :character   Mode  :character  
                    Mean   : 785.2                                        
                    3rd Qu.:1188.0                                        
                    Max.   :1972.0                                        
                                                                          
     dest              air_time        distance           hour      
 Length:435352      Min.   : 18.0   Min.   :  80.0   Min.   : 5.00  
 Class :character   1st Qu.: 77.0   1st Qu.: 479.0   1st Qu.: 9.00  
 Mode  :character   Median :121.0   Median : 762.0   Median :13.00  
                    Mean   :141.8   Mean   : 977.5   Mean   :13.35  
                    3rd Qu.:177.0   3rd Qu.:1182.0   3rd Qu.:17.00  
                    Max.   :701.0   Max.   :4983.0   Max.   :23.00  
                    NA's   :12534                                   
     minute        time_hour                     
 Min.   : 0.00   Min.   :2023-01-01 05:00:00.00  
 1st Qu.:10.00   1st Qu.:2023-03-30 20:00:00.00  
 Median :29.00   Median :2023-06-27 08:00:00.00  
 Mean   :28.53   Mean   :2023-06-29 10:02:22.39  
 3rd Qu.:45.00   3rd Qu.:2023-09-27 11:00:00.00  
 Max.   :59.00   Max.   :2023-12-31 23:00:00.00  
                                                 

use the mutate function to create a new variable called “date” with representation YYYY/MM/DD

# Create a date column using year, month, and day
flights_month <- flights|>
  mutate(date = make_date(year, month, day))

Extract the month from the date column by using a dypler function mutate to create a new variable.

flights_month <- flights |>
  mutate(flights_month = month)

# View the updated dataset with the new month column
head(flights_month)
# A tibble: 6 × 20
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, flights_month <int>
# Create a vector with month names
month_names <- c("January", "February", "March", "April", "May", "June",
                 "July", "August", "September", "October", "November", "December")

# Convert numeric month to character month
flights_months_name <- flights_month |>
  mutate(month_name = month_names[month])

# View the updated dataset with the new month_name column
head(flights_months_name)
# A tibble: 6 × 21
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, flights_month <int>,
#   month_name <chr>

use the dplyer function group_by to group the dataframe by month,and the corresponding average delay of flight.

# Group by month_name and summarize
flight_monthly_summary <- flights_months_name |>
  group_by(month_name) |>
  summarize(
    total_flights = n(),
    average_delay = mean(dep_delay, na.rm = TRUE)
  ) |>
  arrange(factor(month_name, levels = month_names))

# View the summarized data
print(flight_monthly_summary)
# A tibble: 12 × 3
   month_name total_flights average_delay
   <chr>              <int>         <dbl>
 1 January            36020         14.0 
 2 February           34761         11.0 
 3 March              39514         13.0 
 4 April              37476         17.7 
 5 May                38710          8.39
 6 June               35921         24.4 
 7 July               36211         30.5 
 8 August             36765         13.5 
 9 September          35505         17.3 
10 October            36586          5.28
11 November           34521          4.40
12 December           33362          8.33

Plot a dual-axis plot of Total flights and Average Departure Delay per Month.

# Create a dual-axis plot
# Order the months properly
flight_monthly_summary$month_name <- factor(
  flight_monthly_summary$month_name, 
  levels = c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December')
)
ggplot(flight_monthly_summary, aes(x = month_name)) +
  geom_line(aes(y = total_flights, group = 1, color = 'Total Flights'), size = 1) +
  geom_line(aes(y = average_delay * 1000, group = 1, color = 'Average Delay'), size = 1) + 
  scale_y_continuous(
    name = "Total Flights",
    sec.axis = sec_axis(~ . / 1000, name = "Average Delay (minutes)")
  ) +
  labs(title = "Total Flights and Average Delay per Month", x = "Month",caption = "FAA Aircraft Registry") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_color_manual(values = c('Total Flights' = '#6C15F7', 'Average Delay' = '#F71515'))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Summary of Visualization.

The visualization created is a dual-axis geom_line plot showcasing the total number of flights out of NYC airports and the average departure delay for each month in 2023. This plot effectively captures the seasonal fluctuations in both metrics, providing insights into the operational dynamics of NYC airports. One notable aspect of the plot is the pronounced increase in the total number of flights during the summer months, which coincides with the peak vacation season. Despite this increase, the average delay remains relatively stable, suggesting efficient airport management during high-traffic periods. In contrast, the winter months show a higher average delay, likely due to adverse weather conditions such as black ice and snow impacting runway operations. This visualization highlights the importance of seasonal considerations in airport management and offers a clear visual representation of how flight volume and delays interact throughout the year.