Assignment #5A

Author

Michael Mayne

Assignment #5A

Pre Coding- Approach

The approach for this assignment is pretty self explanatory but I first plan on copying the core data onto an Excel sheet and save as a CSV. Which i will load to R, I plan to collect the data. Then use the tidyverse function pivot_longer to set the table as a long format. I doubt tit will be clean at that stage so I will use filter + mutate to make additional columns that separate the value by on time and delayed as seen by the wide table given. Then finally create charts to express the % on time arrival for both airlines.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Coding Base

Recreating the Dataset

Our data was created in excel and by making the value on the sheet with the gap and missing values as shown. The data was saved a CSV and loaded to the github reposititory. Then the data set was loaded below for access for my self and others.

calAirline <- read_csv("https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/main/CalAirline.csv")

New names:
Rows: 5 Columns: 7
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(2): ...1, ...2 dbl (5): Los Angeles, Pheonix, San Diego, San Franciso, Seattle
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
• `` -> `...2`

print(calAirline)

# A tibble: 5 × 7
  ...1    ...2    `Los Angeles` Pheonix `San Diego` `San Franciso` Seattle
  <chr>   <chr>           <dbl>   <dbl>       <dbl>          <dbl>   <dbl>
1 Alaska  on time           497     221         212            503    1841
2 <NA>    delayed            62      12          20            102     305
3 <NA>    <NA>               NA      NA          NA             NA      NA
4 AM WEST on time           694    4840         383            320     201
5 <NA>    delayed           117     415          65            129      61

Get Rid of Unnecessary Rows

Using a simple cut function to remove the 3 row helps us in the long run as there is no vaible data in teh 3 row to lose. So it will honestly get in the way of our cleaning.

calAirline <- calAirline [-3,]

print(calAirline)

# A tibble: 4 × 7
  ...1    ...2    `Los Angeles` Pheonix `San Diego` `San Franciso` Seattle
  <chr>   <chr>           <dbl>   <dbl>       <dbl>          <dbl>   <dbl>
1 Alaska  on time           497     221         212            503    1841
2 <NA>    delayed            62      12          20            102     305
3 AM WEST on time           694    4840         383            320     201
4 <NA>    delayed           117     415          65            129      61

#Now we longer have a row of missing values

Extending the Dataset: The data set tends to repeat and have multiple column represent a value so, I intend to pivot the data set so that it does not come of as that long.

calAirlineLong <- calAirline %>%
  pivot_longer(
    cols= !("...1": "...2"),
    names_to = c("City"),
    values_to = ("Count"),
    values_drop_na = TRUE
  )

Changing the name of the Air line columns, this changes the names of the columns to the airline and status instead of 1 & 2.

calAirlineLong <- calAirlineLong %>%
  rename("Airline"= "...1", "Status"= "...2") 

calAirlineLong <- calAirlineLong %>%
  fill("Airline", .direction = "down")

calAirlineClean<- calAirlineLong

print(calAirlineClean)

# A tibble: 20 × 4
   Airline Status  City         Count
   <chr>   <chr>   <chr>        <dbl>
 1 Alaska  on time Los Angeles    497
 2 Alaska  on time Pheonix        221
 3 Alaska  on time San Diego      212
 4 Alaska  on time San Franciso   503
 5 Alaska  on time Seattle       1841
 6 Alaska  delayed Los Angeles     62
 7 Alaska  delayed Pheonix         12
 8 Alaska  delayed San Diego       20
 9 Alaska  delayed San Franciso   102
10 Alaska  delayed Seattle        305
11 AM WEST on time Los Angeles    694
12 AM WEST on time Pheonix       4840
13 AM WEST on time San Diego      383
14 AM WEST on time San Franciso   320
15 AM WEST on time Seattle        201
16 AM WEST delayed Los Angeles    117
17 AM WEST delayed Pheonix        415
18 AM WEST delayed San Diego       65
19 AM WEST delayed San Franciso   129
20 AM WEST delayed Seattle         61

Break to Load our CSV

The CSV will be publish on Github; but using this code anyone should be able to get access to their own copy of this cleaned data base by running the code here.

write.csv(calAirlineClean, file = "Cal_Airline_Cleaned.csv")

Descriptive Analysis

Now that our data is in tidy format, we can set up an analysis in order to understand the performance of our two airlines Alaska and AMWEST. Staring with their initial overall performance:

Overall Performance of both Airlines

calAirlineClean %>%
  filter(Airline == "Alaska") %>%
  group_by(Status) %>%
  summarize(flights = sum(Count)) %>%
  mutate(totalflights = sum(flights))

# A tibble: 2 × 3
  Status  flights totalflights
  <chr>     <dbl>        <dbl>
1 delayed     501         3775
2 on time    3274         3775

Result of Alaska

TimelyAlaska <- (3274/3775)*100
print(TimelyAlaska)

[1] 86.72848

DelayedAlaska <- (100 - TimelyAlaska)
#Timely Alaska is the amount in which Alaska is on time 

print(DelayedAlaska)

[1] 13.27152

calAirlineClean %>%
  filter(Airline == "AM WEST") %>%
  group_by(Status) %>%
  summarize(flights = sum(Count)) %>%
  mutate(totalflights = sum(flights))

# A tibble: 2 × 3
  Status  flights totalflights
  <chr>     <dbl>        <dbl>
1 delayed     787         7225
2 on time    6438         7225

Result of AMWEST

TimelyAMWEST <- (6438/7225)*100
print(TimelyAMWEST)

[1] 89.10727

#Timely AMWEST show the percentage in which AM WEST is on time
DelayedAMWEST <- (100 - TimelyAMWEST)

print(DelayedAMWEST)

[1] 10.89273

Chart of the Airline (Overall)

Arrival_All <- data.frame(
  Time = c("On Time", "Delayed"),
  Alaska = c (TimelyAlaska, DelayedAlaska),
  AMWEST = c (TimelyAMWEST, DelayedAMWEST)
)

print(Arrival_All)

     Time   Alaska   AMWEST
1 On Time 86.72848 89.10727
2 Delayed 13.27152 10.89273

We can conclude that overall AMWEST has better net track record than Alaska Airline when it comes to flight times overall with Alaska boasts about a 87% on-time arrival rate but fall slightly behind AMWEST with 89%. Although there is more than can be done like comparing the flight times for different cities.

Grouping Alaska & AMWEST by City

calAirlineClean %>%
  group_by(City,Airline, Status) %>%
  filter(Airline == "Alaska") %>%
  summarize(flights = sum(Count))

`summarise()` has grouped output by 'City', 'Airline'. You can override using
the `.groups` argument.

# A tibble: 10 × 4
# Groups:   City, Airline [5]
   City         Airline Status  flights
   <chr>        <chr>   <chr>     <dbl>
 1 Los Angeles  Alaska  delayed      62
 2 Los Angeles  Alaska  on time     497
 3 Pheonix      Alaska  delayed      12
 4 Pheonix      Alaska  on time     221
 5 San Diego    Alaska  delayed      20
 6 San Diego    Alaska  on time     212
 7 San Franciso Alaska  delayed     102
 8 San Franciso Alaska  on time     503
 9 Seattle      Alaska  delayed     305
10 Seattle      Alaska  on time    1841

Airline_Performance_City <- calAirlineClean %>%
  group_by(City, Airline) %>%
  mutate (Rate = Count/sum(Count) * 100) 

print(Airline_Performance_City)

# A tibble: 20 × 5
# Groups:   City, Airline [10]
   Airline Status  City         Count  Rate
   <chr>   <chr>   <chr>        <dbl> <dbl>
 1 Alaska  on time Los Angeles    497 88.9 
 2 Alaska  on time Pheonix        221 94.8 
 3 Alaska  on time San Diego      212 91.4 
 4 Alaska  on time San Franciso   503 83.1 
 5 Alaska  on time Seattle       1841 85.8 
 6 Alaska  delayed Los Angeles     62 11.1 
 7 Alaska  delayed Pheonix         12  5.15
 8 Alaska  delayed San Diego       20  8.62
 9 Alaska  delayed San Franciso   102 16.9 
10 Alaska  delayed Seattle        305 14.2 
11 AM WEST on time Los Angeles    694 85.6 
12 AM WEST on time Pheonix       4840 92.1 
13 AM WEST on time San Diego      383 85.5 
14 AM WEST on time San Franciso   320 71.3 
15 AM WEST on time Seattle        201 76.7 
16 AM WEST delayed Los Angeles    117 14.4 
17 AM WEST delayed Pheonix        415  7.90
18 AM WEST delayed San Diego       65 14.5 
19 AM WEST delayed San Franciso   129 28.7 
20 AM WEST delayed Seattle         61 23.3

Airline_Performance_City %>%
  group_by(Airline, Status, City)

# A tibble: 20 × 5
# Groups:   Airline, Status, City [20]
   Airline Status  City         Count  Rate
   <chr>   <chr>   <chr>        <dbl> <dbl>
 1 Alaska  on time Los Angeles    497 88.9 
 2 Alaska  on time Pheonix        221 94.8 
 3 Alaska  on time San Diego      212 91.4 
 4 Alaska  on time San Franciso   503 83.1 
 5 Alaska  on time Seattle       1841 85.8 
 6 Alaska  delayed Los Angeles     62 11.1 
 7 Alaska  delayed Pheonix         12  5.15
 8 Alaska  delayed San Diego       20  8.62
 9 Alaska  delayed San Franciso   102 16.9 
10 Alaska  delayed Seattle        305 14.2 
11 AM WEST on time Los Angeles    694 85.6 
12 AM WEST on time Pheonix       4840 92.1 
13 AM WEST on time San Diego      383 85.5 
14 AM WEST on time San Franciso   320 71.3 
15 AM WEST on time Seattle        201 76.7 
16 AM WEST delayed Los Angeles    117 14.4 
17 AM WEST delayed Pheonix        415  7.90
18 AM WEST delayed San Diego       65 14.5 
19 AM WEST delayed San Franciso   129 28.7 
20 AM WEST delayed Seattle         61 23.3

Airline_Delays <- Airline_Performance_City %>%
  filter(Status == "delayed")


ggplot(data = Airline_Delays, aes(x= City, y= Rate, fill= Airline))+
  geom_bar(stat= "identity", position = "dodge") +
  scale_fill_manual(values = c("Alaska" = "blue", "AM WEST" = "darkgreen"))+
  labs(y= "% of Flights", x= "Arrival City", title= "Percentage of Delayed flights in by AMWEST + ALASKA")

Conclusions

As we can see in the chart above, the success rate of the airline is actually a bit more complex that we assume at first. AM WEST is shown to initially have a higher arrival rate and overall that is true. Although by looking at the information deeper we can tell that the the arrival time for cities tells a different story. AM West shows a higher percentage of delays when showing individuals cities. We can attribute AM WEST’s higher success rate because its overall high number of flights in and out.