Assignment5A

Approach

For this assignment i am going to attempt to leverage mySQL and create a database with the data given. I think it will be good practice and good experience. Alot of this assignment will be learning the functions from tidyverse on how to clean up data. I will leverage LLM where needed to possibly explore options from tidyverse and see if there is a list of functions. I find using LLM to describe best practices to problems is also a good way to use the LLM.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Creating the CSV data here by defining each colum first and then each row as presented in the table on the pdf

airlines_data <- "airline,status,Los_Angeles,Phoenix,San_Diego,San_Francisco,Seattle
ALASKA,on time,497,221,212,503,1841
ALASKA,delayed,62,12,20,102,305
AM WEST,on time,694,4840,383,320,201
AM WEST,delayed,117,415,65,129,61"

write.csv(read.csv(text = airlines_data), "airlines.csv", row.names = FALSE)

airlines <- read.csv("airlines.csv")
airlines
  airline  status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1  ALASKA on time         497     221       212           503    1841
2  ALASKA delayed          62      12        20           102     305
3 AM WEST on time         694    4840       383           320     201
4 AM WEST delayed         117     415        65           129      61

The first way we cleaned this data is instead of there being 5 different columns for each city they all fall under 1 category which is destinations and for each location we have a number of flights which is created in a new column. Already the data is far more readable. pivots_longer takes the columns and creates rows

airlines_tidy <- airlines %>%
  pivot_longer(
    cols = c(Los_Angeles, Phoenix, San_Diego, San_Francisco, Seattle),
    names_to = "destination",
    values_to = "flights"
  )
airlines_tidy
# A tibble: 20 × 4
   airline status  destination   flights
   <chr>   <chr>   <chr>           <int>
 1 ALASKA  on time Los_Angeles       497
 2 ALASKA  on time Phoenix           221
 3 ALASKA  on time San_Diego         212
 4 ALASKA  on time San_Francisco     503
 5 ALASKA  on time Seattle          1841
 6 ALASKA  delayed Los_Angeles        62
 7 ALASKA  delayed Phoenix            12
 8 ALASKA  delayed San_Diego          20
 9 ALASKA  delayed San_Francisco     102
10 ALASKA  delayed Seattle           305
11 AM WEST on time Los_Angeles       694
12 AM WEST on time Phoenix          4840
13 AM WEST on time San_Diego         383
14 AM WEST on time San_Francisco     320
15 AM WEST on time Seattle           201
16 AM WEST delayed Los_Angeles       117
17 AM WEST delayed Phoenix           415
18 AM WEST delayed San_Diego          65
19 AM WEST delayed San_Francisco     129
20 AM WEST delayed Seattle            61
airlines_wide <- airlines_tidy %>%
  pivot_wider(
    names_from = status,
    values_from = flights
  ) %>%
  rename(on_time = `on time`) %>%
  mutate(
    total = on_time + delayed,
    delay_rate = round(delayed / total * 100, 1)
  )
airlines_wide
# A tibble: 10 × 6
   airline destination   on_time delayed total delay_rate
   <chr>   <chr>           <int>   <int> <int>      <dbl>
 1 ALASKA  Los_Angeles       497      62   559       11.1
 2 ALASKA  Phoenix           221      12   233        5.2
 3 ALASKA  San_Diego         212      20   232        8.6
 4 ALASKA  San_Francisco     503     102   605       16.9
 5 ALASKA  Seattle          1841     305  2146       14.2
 6 AM WEST Los_Angeles       694     117   811       14.4
 7 AM WEST Phoenix          4840     415  5255        7.9
 8 AM WEST San_Diego         383      65   448       14.5
 9 AM WEST San_Francisco     320     129   449       28.7
10 AM WEST Seattle           201      61   262       23.3

calculating overall delay rate. What important to note here is that Alaska has a higher overall delay rate than AM west with far more flights as well.

overall <- airlines_wide %>%
  group_by(airline) %>%
  summarise(
    total_flights = sum(total),
    total_delayed = sum(delayed),
    overall_delay_rate = round(total_delayed / total_flights * 100, 1)
  )
overall
# A tibble: 2 × 4
  airline total_flights total_delayed overall_delay_rate
  <chr>           <int>         <int>              <dbl>
1 ALASKA           3775           501               13.3
2 AM WEST          7225           787               10.9
ggplot(airlines_wide, aes(x = destination, y = delay_rate, fill = airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Delay Rate by Destination and Airline",
    x = "Destination",
    y = "Delay Rate (%)",
    fill = "Airline"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("steelblue", "tomato"))

Conclusions

After tidying and transforming the data, the analysis reveals an interesting paradox. On the surface, AM WEST appears to have a lower overall delay rate than ALASKA. However, when broken down by destination, ALASKA actually outperforms AM WEST at every single city.