The approach for this assignment is pretty self explanatory but I first plan on copying the core data onto an Excel sheet and save as a CSV. Which i will load to R, I plan to collect the data. Then use the tidyverse function pivot_longer to set the table as a long format. I doubt tit will be clean at that stage so I will use filter + mutate to make additional columns that separate the value by on time and delayed as seen by the wide table given. Then finally create charts to express the % on time arrival for both airlines.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Coding Base
Recreating the Dataset
Our data was created in excel and by making the value on the sheet with the gap and missing values as shown. The data was saved a CSV and loaded to the github reposititory. Then the data set was loaded below for access for my self and others.
New names:
Rows: 5 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): ...1, ...2 dbl (5): Los Angeles, Pheonix, San Diego, San Franciso, Seattle
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
• `` -> `...2`
print(calAirline)
# A tibble: 5 × 7
...1 ...2 `Los Angeles` Pheonix `San Diego` `San Franciso` Seattle
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaska on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 <NA> <NA> NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 <NA> delayed 117 415 65 129 61
Get Rid of Unnecessary Rows
Using a simple cut function to remove the 3 row helps us in the long run as there is no vaible data in teh 3 row to lose. So it will honestly get in the way of our cleaning.
calAirline <- calAirline [-3,]
print(calAirline)
# A tibble: 4 × 7
...1 ...2 `Los Angeles` Pheonix `San Diego` `San Franciso` Seattle
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaska on time 497 221 212 503 1841
2 <NA> delayed 62 12 20 102 305
3 AM WEST on time 694 4840 383 320 201
4 <NA> delayed 117 415 65 129 61
#Now we longer have a row of missing values
Extending the Dataset: The data set tends to repeat and have multiple column represent a value so, I intend to pivot the data set so that it does not come of as that long.
# A tibble: 20 × 4
Airline Status City Count
<chr> <chr> <chr> <dbl>
1 Alaska on time Los Angeles 497
2 Alaska on time Pheonix 221
3 Alaska on time San Diego 212
4 Alaska on time San Franciso 503
5 Alaska on time Seattle 1841
6 Alaska delayed Los Angeles 62
7 Alaska delayed Pheonix 12
8 Alaska delayed San Diego 20
9 Alaska delayed San Franciso 102
10 Alaska delayed Seattle 305
11 AM WEST on time Los Angeles 694
12 AM WEST on time Pheonix 4840
13 AM WEST on time San Diego 383
14 AM WEST on time San Franciso 320
15 AM WEST on time Seattle 201
16 AM WEST delayed Los Angeles 117
17 AM WEST delayed Pheonix 415
18 AM WEST delayed San Diego 65
19 AM WEST delayed San Franciso 129
20 AM WEST delayed Seattle 61
Break to Load our CSV
The CSV will be publish on Github; but using this code anyone should be able to get access to their own copy of this cleaned data base by running the code here.
Now that our data is in tidy format, we can set up an analysis in order to understand the performance of our two airlines Alaska and AMWEST. Staring with their initial overall performance:
Time Alaska AMWEST
1 On Time 86.72848 89.10727
2 Delayed 13.27152 10.89273
We can conclude that overall AMWEST has better net track record than Alaska Airline when it comes to flight times overall with Alaska boasts about a 87% on-time arrival rate but fall slightly behind AMWEST with 89%. Although there is more than can be done like comparing the flight times for different cities.
`summarise()` has grouped output by 'City', 'Airline'. You can override using
the `.groups` argument.
# A tibble: 10 × 4
# Groups: City, Airline [5]
City Airline Status flights
<chr> <chr> <chr> <dbl>
1 Los Angeles Alaska delayed 62
2 Los Angeles Alaska on time 497
3 Pheonix Alaska delayed 12
4 Pheonix Alaska on time 221
5 San Diego Alaska delayed 20
6 San Diego Alaska on time 212
7 San Franciso Alaska delayed 102
8 San Franciso Alaska on time 503
9 Seattle Alaska delayed 305
10 Seattle Alaska on time 1841
# A tibble: 20 × 5
# Groups: City, Airline [10]
Airline Status City Count Rate
<chr> <chr> <chr> <dbl> <dbl>
1 Alaska on time Los Angeles 497 88.9
2 Alaska on time Pheonix 221 94.8
3 Alaska on time San Diego 212 91.4
4 Alaska on time San Franciso 503 83.1
5 Alaska on time Seattle 1841 85.8
6 Alaska delayed Los Angeles 62 11.1
7 Alaska delayed Pheonix 12 5.15
8 Alaska delayed San Diego 20 8.62
9 Alaska delayed San Franciso 102 16.9
10 Alaska delayed Seattle 305 14.2
11 AM WEST on time Los Angeles 694 85.6
12 AM WEST on time Pheonix 4840 92.1
13 AM WEST on time San Diego 383 85.5
14 AM WEST on time San Franciso 320 71.3
15 AM WEST on time Seattle 201 76.7
16 AM WEST delayed Los Angeles 117 14.4
17 AM WEST delayed Pheonix 415 7.90
18 AM WEST delayed San Diego 65 14.5
19 AM WEST delayed San Franciso 129 28.7
20 AM WEST delayed Seattle 61 23.3
# A tibble: 20 × 5
# Groups: Airline, Status, City [20]
Airline Status City Count Rate
<chr> <chr> <chr> <dbl> <dbl>
1 Alaska on time Los Angeles 497 88.9
2 Alaska on time Pheonix 221 94.8
3 Alaska on time San Diego 212 91.4
4 Alaska on time San Franciso 503 83.1
5 Alaska on time Seattle 1841 85.8
6 Alaska delayed Los Angeles 62 11.1
7 Alaska delayed Pheonix 12 5.15
8 Alaska delayed San Diego 20 8.62
9 Alaska delayed San Franciso 102 16.9
10 Alaska delayed Seattle 305 14.2
11 AM WEST on time Los Angeles 694 85.6
12 AM WEST on time Pheonix 4840 92.1
13 AM WEST on time San Diego 383 85.5
14 AM WEST on time San Franciso 320 71.3
15 AM WEST on time Seattle 201 76.7
16 AM WEST delayed Los Angeles 117 14.4
17 AM WEST delayed Pheonix 415 7.90
18 AM WEST delayed San Diego 65 14.5
19 AM WEST delayed San Franciso 129 28.7
20 AM WEST delayed Seattle 61 23.3
Airline_Delays <- Airline_Performance_City %>%filter(Status =="delayed")ggplot(data = Airline_Delays, aes(x= City, y= Rate, fill= Airline))+geom_bar(stat="identity", position ="dodge") +scale_fill_manual(values =c("Alaska"="blue", "AM WEST"="darkgreen"))+labs(y="% of Flights", x="Arrival City", title="Percentage of Delayed flights in by AMWEST + ALASKA")
Conclusions
As we can see in the chart above, the success rate of the airline is actually a bit more complex that we assume at first. AM WEST is shown to initially have a higher arrival rate and overall that is true. Although by looking at the information deeper we can tell that the the arrival time for cities tells a different story. AM West shows a higher percentage of delays when showing individuals cities. We can attribute AM WEST’s higher success rate because its overall high number of flights in and out.