# A tibble: 6 × 6
# Groups: year, month, day [1]
year month day sched_dep_time dep_delay carrier
<int> <int> <int> <int> <dbl> <chr>
1 2023 11 26 2140 149 UA
2 2023 11 26 2159 288 DL
3 2023 11 26 2103 382 AA
4 2023 11 26 600 15 DL
5 2023 11 26 530 69 AA
6 2023 11 26 630 52 UA
summary(nov26)
year month day sched_dep_time dep_delay
Min. :2023 Min. :11 Min. :26 Min. : 530 Min. : 11.00
1st Qu.:2023 1st Qu.:11 1st Qu.:26 1st Qu.:1357 1st Qu.: 22.50
Median :2023 Median :11 Median :26 Median :1715 Median : 39.00
Mean :2023 Mean :11 Mean :26 Mean :1608 Mean : 72.75
3rd Qu.:2023 3rd Qu.:11 3rd Qu.:26 3rd Qu.:1948 3rd Qu.: 77.50
Max. :2023 Max. :11 Max. :26 Max. :2215 Max. :1074.00
carrier
Length:159
Class :character
Mode :character
Create scatter plot with delays by time of day for each carrier
p1 <- nov26 |>ggplot(aes(sched_dep_time, dep_delay, color=carrier, shape=carrier, fill=carrier)) +labs(x ="Scheduled Departure Time", y ="Delay (minutes)", title ="Flight Delays (more than 10 minutes) on November 26, 2023",caption ="Source: https://openflights.org/") +scale_shape_manual(values=c(15, 18, 16, 17)) +geom_point() +geom_smooth(method=lm, se=FALSE) p1
`geom_smooth()` using formula = 'y ~ x'
Description of visualization
I decided to investigate flight delays on what is often referred to as the busiest day of air travel of the year, the Sunday after Thanksgiving (which, in 2023, was November 26). Specifically, I wanted to know if flight delays increased throughout the day and if any carriers experienced more delays than others. I made a scatterplot that has the time throughout the day on the x axis and the delay time in minutes on the y axis. The graph is filled by carrier, with data points for each carrier having a distinct color and shape*. I also included a trendline for each carrier that matches its color. When I first made this visualization, I included all carriers and all delayed flights (even those delayed by only 1 minute). This resulted in an overcrowded graph that was difficult to read. There was just too much data. Therefore, I chose to only include delays that were more than ten minutes since these are realistically more disruptive to travel. I also included only what I considered to be the major airlines. Since both of these actions reduced the quantity of data displayed, the resulting visualization is more readable.
code for changing the shape and color of the data points came from sthda.com