DATA 110 HW 5

Author

EHiggs

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)

library(stringr)

data(flights)
flights_filter <- flights |>
  filter(!is.na(origin) & !is.na(arr_delay) & !is.na(carrier))
noAcronym <- left_join(flights_filter, airlines, by = "carrier")
noAcronym$name <- gsub("Inc\\.|Co\\.", "", noAcronym$name)
noAirline <- noAcronym %>%
  filter(!str_detect(name, "American Airlines")) #Their late value was super high and made the early section look too small so I filtered American Airlines out
library(ggplot2)

noAirline$deviation = noAirline$arr_delay # I got tired of rewriting all the underscores when debugging so I made a new variable

ggplot(noAirline, aes(x = name, y = deviation)) +
  geom_col(aes(fill = deviation > 0), position = position_dodge()) +
  scale_fill_manual(values = c("TRUE" = "lightblue", "FALSE" = "lightpink"), labels = c(
"FALSE" = "Early", "TRUE" = "Late")) +  #Not sure if this command is used in class but I saw it in "The Book of R" textbook.
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "darkgrey") +
  scale_y_continuous(limits = c(-300, 1450)) +  
  coord_flip() +
  labs(fill = "Tardiness", x = "Airline Carrier", y="Typical Arrival Time \n Early                                                                                              Late", caption = "Data gotten from NYC Flights") # I couldn't find out how to line the labels up to both sides of the chart so I had to hard code it... sorry
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_col()`).

I made a bar graph that shows the spread of data for how early or late airlines tend to be. I would like to highlight how there is a different color for the negative values and if a flight is early, it is stored in a different bar than if it were late. I think it is interesting to see just how late some flights are, for example, SkyWest Airlines is significantly more late than Envoy Air. It was quite a challenge to create two different bar graphs for the positive and negative values and I encountered a lot of errors before using Dyplr commands to create an entire new dataset derived from the original dataset.