NYC Flights HW Assignment

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
avg_delays <- flights %>%
  group_by (carrier) %>%
  summarize(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange (desc(avg_delay))
num_flights <- flights %>%
  group_by(carrier) %>%
  tally()
avg_delays <- left_join(avg_delays, num_flights, by = "carrier")

ggplot(avg_delays, aes(x = carrier, y = avg_delay, size = n)) +
  geom_point(aes(color = avg_delay)) +  # The color represents the average delay
  scale_color_gradient(low = "blue", high = "red") +
  labs(
    title = "Average Flight Delays by Airline",
    x = "Airline Carrier",
    y = "Average Delay (in minutes)",
    subtitle = "Bubble size indicates number of flights"
  ) +
  theme_minimal() +
  guides(color = guide_legend(title = "Avg. Delay"), size = guide_legend(title = "No. of Flights"))

For my visualization I decided to build off of what we learned about scatter plots and create a bubble scatter plot so that I could communicate more information at once. Opposed to a regular Scatter Plot where I can only show my X and Y variables and not something greater than that. The first thing I wanted to do was grouped the data I would be graphing by airline carrier. I did this by using the dplyr command “group_by”. After I grouped them I created my second line of code. This line “summarize(avg_delay)….” averaged out the delay time. I used geom_point to create a scatter plot but I built upon that to create a bubble graph. I decided to range from red to blue because I felt like they were stark in contrast so people would be able to see the differences. The code I described above resulted in a graph that shows each airline along the x-axis and their average delay time on the y-axis. The size of the point corresponds with the number of flights the airline had during that period. I think a major highlight of this visualization is that it shows how carriers with a greater volume of flights might not always have the longest delays like I had though originally. This fact that smaller airlines had longer average delays could offer insights into operational efficiency. If flight volume in not a consistent factor in delayed flights something else must be responsible and data analysis could be used to tackle that issue.