NYC Flights Homework

Author

David Burkart

Load the libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
Warning: package 'nycflights23' was built under R version 4.4.2

Create a dataframe and investigate the data

flights_df <- flights
view (flights_df)
summary(flights_df)
      year          month             day           dep_time     sched_dep_time
 Min.   :2023   Min.   : 1.000   Min.   : 1.00   Min.   :   1    Min.   : 500  
 1st Qu.:2023   1st Qu.: 3.000   1st Qu.: 8.00   1st Qu.: 931    1st Qu.: 930  
 Median :2023   Median : 6.000   Median :16.00   Median :1357    Median :1359  
 Mean   :2023   Mean   : 6.423   Mean   :15.74   Mean   :1366    Mean   :1364  
 3rd Qu.:2023   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:1804    3rd Qu.:1759  
 Max.   :2023   Max.   :12.000   Max.   :31.00   Max.   :2400    Max.   :2359  
                                                 NA's   :10738                 
   dep_delay          arr_time     sched_arr_time   arr_delay       
 Min.   : -50.00   Min.   :   1    Min.   :   1   Min.   : -97.000  
 1st Qu.:  -6.00   1st Qu.:1105    1st Qu.:1135   1st Qu.: -22.000  
 Median :  -2.00   Median :1519    Median :1551   Median : -10.000  
 Mean   :  13.84   Mean   :1497    Mean   :1552   Mean   :   4.345  
 3rd Qu.:  10.00   3rd Qu.:1946    3rd Qu.:2007   3rd Qu.:   9.000  
 Max.   :1813.00   Max.   :2400    Max.   :2359   Max.   :1812.000  
 NA's   :10738     NA's   :11453                  NA's   :12534     
   carrier              flight         tailnum             origin         
 Length:435352      Min.   :   1.0   Length:435352      Length:435352     
 Class :character   1st Qu.: 364.0   Class :character   Class :character  
 Mode  :character   Median : 734.0   Mode  :character   Mode  :character  
                    Mean   : 785.2                                        
                    3rd Qu.:1188.0                                        
                    Max.   :1972.0                                        
                                                                          
     dest              air_time        distance           hour      
 Length:435352      Min.   : 18.0   Min.   :  80.0   Min.   : 5.00  
 Class :character   1st Qu.: 77.0   1st Qu.: 479.0   1st Qu.: 9.00  
 Mode  :character   Median :121.0   Median : 762.0   Median :13.00  
                    Mean   :141.8   Mean   : 977.5   Mean   :13.35  
                    3rd Qu.:177.0   3rd Qu.:1182.0   3rd Qu.:17.00  
                    Max.   :701.0   Max.   :4983.0   Max.   :23.00  
                    NA's   :12534                                   
     minute        time_hour                     
 Min.   : 0.00   Min.   :2023-01-01 05:00:00.00  
 1st Qu.:10.00   1st Qu.:2023-03-30 20:00:00.00  
 Median :29.00   Median :2023-06-27 08:00:00.00  
 Mean   :28.53   Mean   :2023-06-29 10:02:22.39  
 3rd Qu.:45.00   3rd Qu.:2023-09-27 11:00:00.00  
 Max.   :59.00   Max.   :2023-12-31 23:00:00.00  
                                                 

Remove any unneeded data, then check to make sure it was done correctly

nov26 <- flights_df |>
  select(year, month, day, sched_dep_time, dep_delay, carrier) |>
  filter(month == 11, day == 26) |>
  filter(dep_delay>10) |>
  filter(carrier %in% c("AA", "UA", "WN", "DL")) |>
  group_by(year, month, day)
head(nov26)
# A tibble: 6 × 6
# Groups:   year, month, day [1]
   year month   day sched_dep_time dep_delay carrier
  <int> <int> <int>          <int>     <dbl> <chr>  
1  2023    11    26           2140       149 UA     
2  2023    11    26           2159       288 DL     
3  2023    11    26           2103       382 AA     
4  2023    11    26            600        15 DL     
5  2023    11    26            530        69 AA     
6  2023    11    26            630        52 UA     
summary(nov26)
      year          month         day     sched_dep_time   dep_delay      
 Min.   :2023   Min.   :11   Min.   :26   Min.   : 530   Min.   :  11.00  
 1st Qu.:2023   1st Qu.:11   1st Qu.:26   1st Qu.:1357   1st Qu.:  22.50  
 Median :2023   Median :11   Median :26   Median :1715   Median :  39.00  
 Mean   :2023   Mean   :11   Mean   :26   Mean   :1608   Mean   :  72.75  
 3rd Qu.:2023   3rd Qu.:11   3rd Qu.:26   3rd Qu.:1948   3rd Qu.:  77.50  
 Max.   :2023   Max.   :11   Max.   :26   Max.   :2215   Max.   :1074.00  
   carrier         
 Length:159        
 Class :character  
 Mode  :character  
                   
                   
                   

Rename carrier codes to full names for clarity

nov26$carrier<-gsub("AA","American Airlines",nov26$carrier)
nov26$carrier<-gsub("DL","Delta Airlines",nov26$carrier)
nov26$carrier<-gsub("UA","United Airlines",nov26$carrier)
nov26$carrier<-gsub("WN","Southwest Airlines",nov26$carrier)

Create scatter plot with delays by time of day for each carrier

p1 <- nov26 |>
ggplot(aes(sched_dep_time, dep_delay, color=carrier, shape=carrier, fill=carrier)) + 
  labs(x = "Scheduled Departure Time", y = "Delay (minutes)", 
       title = "Flight Delays (more than 10 minutes) on November 26, 2023",
       caption = "Source: https://openflights.org/") +
  scale_shape_manual(values=c(15, 18, 16, 17)) +
  geom_point() + geom_smooth(method=lm, se=FALSE) 
p1
`geom_smooth()` using formula = 'y ~ x'

Description of visualization

I decided to investigate flight delays on what is often referred to as the busiest day of air travel of the year, the Sunday after Thanksgiving (which, in 2023, was November 26). Specifically, I wanted to know if flight delays increased throughout the day and if any carriers experienced more delays than others. I made a scatterplot that has the time throughout the day on the x axis and the delay time in minutes on the y axis. The graph is filled by carrier, with data points for each carrier having a distinct color and shape*. I also included a trendline for each carrier that matches its color. When I first made this visualization, I included all carriers and all delayed flights (even those delayed by only 1 minute). This resulted in an overcrowded graph that was difficult to read. There was just too much data. Therefore, I chose to only include delays that were more than ten minutes since these are realistically more disruptive to travel. I also included only what I considered to be the major airlines. Since both of these actions reduced the quantity of data displayed, the resulting visualization is more readable.

  • code for changing the shape and color of the data points came from sthda.com