For this assignment i am going to attempt to leverage mySQL and create a database with the data given. I think it will be good practice and good experience. Alot of this assignment will be learning the functions from tidyverse on how to clean up data. I will leverage LLM where needed to possibly explore options from tidyverse and see if there is a list of functions. I find using LLM to describe best practices to problems is also a good way to use the LLM.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Creating the CSV data here by defining each colum first and then each row as presented in the table on the pdf
airline status Los_Angeles Phoenix San_Diego San_Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 AM WEST delayed 117 415 65 129 61
The first way we cleaned this data is instead of there being 5 different columns for each city they all fall under 1 category which is destinations and for each location we have a number of flights which is created in a new column. Already the data is far more readable. pivots_longer takes the columns and creates rows
# A tibble: 20 × 4
airline status destination flights
<chr> <chr> <chr> <int>
1 ALASKA on time Los_Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San_Diego 212
4 ALASKA on time San_Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los_Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San_Diego 20
9 ALASKA delayed San_Francisco 102
10 ALASKA delayed Seattle 305
11 AM WEST on time Los_Angeles 694
12 AM WEST on time Phoenix 4840
13 AM WEST on time San_Diego 383
14 AM WEST on time San_Francisco 320
15 AM WEST on time Seattle 201
16 AM WEST delayed Los_Angeles 117
17 AM WEST delayed Phoenix 415
18 AM WEST delayed San_Diego 65
19 AM WEST delayed San_Francisco 129
20 AM WEST delayed Seattle 61
# A tibble: 2 × 4
airline total_flights total_delayed overall_delay_rate
<chr> <int> <int> <dbl>
1 ALASKA 3775 501 13.3
2 AM WEST 7225 787 10.9
ggplot(airlines_wide, aes(x = destination, y = delay_rate, fill = airline)) +geom_bar(stat ="identity", position ="dodge") +labs(title ="Delay Rate by Destination and Airline",x ="Destination",y ="Delay Rate (%)",fill ="Airline" ) +theme_minimal() +scale_fill_manual(values =c("steelblue", "tomato"))
Conclusions
After tidying and transforming the data, the analysis reveals an interesting paradox. On the surface, AM WEST appears to have a lower overall delay rate than ALASKA. However, when broken down by destination, ALASKA actually outperforms AM WEST at every single city.