Introduction

This assignment is to observe and compare the arrival delays of flights for two airlines at five destinations. Also, I need to organize the provided chart and reshape the data into a structured data frame for use.

Data Acquisition

Create CSV file by mirroring all of the information from image.

Input file:"Image_dataframe.csv” from https://github.com/ZIXIANNOW/DATA607week5/blob/main/Image_dataframe.csv

url <- 'https://raw.githubusercontent.com/ZIXIANNOW/DATA607week5/main/Image_dataframe.csv'
database <- read_csv(url)
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
database
## # A tibble: 5 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 <NA>    <NA>               NA      NA          NA              NA      NA
## 4 AM WEST on time           694    4840         383             320     201
## 5 <NA>    delayed           117     415          65             129      61

Data Correction

There is one separator line should be removed and two airline names are missing that should be added back.

data_clean <- database[!apply(is.na(database) | database == "NA", 1, all), ]
data_clean
## # A tibble: 4 × 7
##   ...1    ...2    `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 AM WEST on time           694    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61
names(data_clean)[1] <- "Airline"
names(data_clean)[2] <- "Flight_Status"
first_na_index <- which(is.na(data_clean$Airline))[1]
second_na_index <- which(is.na(data_clean$Airline))[2]
data_clean$Airline[first_na_index] <- "ALASKA"
data_clean$Airline[second_na_index] <- "AM WEST"
data_clean
## # A tibble: 4 × 7
##   Airline Flight_Status `Los Angeles` Phoenix `San Diego` `San Francisco`
##   <chr>   <chr>                 <dbl>   <dbl>       <dbl>           <dbl>
## 1 ALASKA  on time                 497     221         212             503
## 2 ALASKA  delayed                  62      12          20             102
## 3 AM WEST on time                 694    4840         383             320
## 4 AM WEST delayed                 117     415          65             129
## # ℹ 1 more variable: Seattle <dbl>

From here, will mutate variables for better counting.

flights <- pivot_longer(data_clean, cols = 3:7, names_to = "Destination", values_to = "Count")
flights
## # A tibble: 20 × 4
##    Airline Flight_Status Destination   Count
##    <chr>   <chr>         <chr>         <dbl>
##  1 ALASKA  on time       Los Angeles     497
##  2 ALASKA  on time       Phoenix         221
##  3 ALASKA  on time       San Diego       212
##  4 ALASKA  on time       San Francisco   503
##  5 ALASKA  on time       Seattle        1841
##  6 ALASKA  delayed       Los Angeles      62
##  7 ALASKA  delayed       Phoenix          12
##  8 ALASKA  delayed       San Diego        20
##  9 ALASKA  delayed       San Francisco   102
## 10 ALASKA  delayed       Seattle         305
## 11 AM WEST on time       Los Angeles     694
## 12 AM WEST on time       Phoenix        4840
## 13 AM WEST on time       San Diego       383
## 14 AM WEST on time       San Francisco   320
## 15 AM WEST on time       Seattle         201
## 16 AM WEST delayed       Los Angeles     117
## 17 AM WEST delayed       Phoenix         415
## 18 AM WEST delayed       San Diego        65
## 19 AM WEST delayed       San Francisco   129
## 20 AM WEST delayed       Seattle          61
flight_update <-  flights %>% 
             pivot_wider(names_from = Flight_Status, values_from = Count)
flight_update
## # A tibble: 10 × 4
##    Airline Destination   `on time` delayed
##    <chr>   <chr>             <dbl>   <dbl>
##  1 ALASKA  Los Angeles         497      62
##  2 ALASKA  Phoenix             221      12
##  3 ALASKA  San Diego           212      20
##  4 ALASKA  San Francisco       503     102
##  5 ALASKA  Seattle            1841     305
##  6 AM WEST Los Angeles         694     117
##  7 AM WEST Phoenix            4840     415
##  8 AM WEST San Diego           383      65
##  9 AM WEST San Francisco       320     129
## 10 AM WEST Seattle             201      61

Comparison

Now adding additional columns for more intuitive comparison.

flight_sum <- flight_update %>%
  mutate(total_flight = rowSums(select(.,"on time", "delayed")))
flight_sum
## # A tibble: 10 × 5
##    Airline Destination   `on time` delayed total_flight
##    <chr>   <chr>             <dbl>   <dbl>        <dbl>
##  1 ALASKA  Los Angeles         497      62          559
##  2 ALASKA  Phoenix             221      12          233
##  3 ALASKA  San Diego           212      20          232
##  4 ALASKA  San Francisco       503     102          605
##  5 ALASKA  Seattle            1841     305         2146
##  6 AM WEST Los Angeles         694     117          811
##  7 AM WEST Phoenix            4840     415         5255
##  8 AM WEST San Diego           383      65          448
##  9 AM WEST San Francisco       320     129          449
## 10 AM WEST Seattle             201      61          262
flight_subsum <- flight_sum %>%
group_by(Airline,Destination) %>%
summarise(delayed_total = sum(delayed) , total = sum(total_flight), delayed_rate = sum(delayed) / sum(total_flight)) %>%
arrange(Airline,delayed_rate)
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
flight_subsum
## # A tibble: 10 × 5
## # Groups:   Airline [2]
##    Airline Destination   delayed_total total delayed_rate
##    <chr>   <chr>                 <dbl> <dbl>        <dbl>
##  1 ALASKA  Phoenix                  12   233       0.0515
##  2 ALASKA  San Diego                20   232       0.0862
##  3 ALASKA  Los Angeles              62   559       0.111 
##  4 ALASKA  Seattle                 305  2146       0.142 
##  5 ALASKA  San Francisco           102   605       0.169 
##  6 AM WEST Phoenix                 415  5255       0.0790
##  7 AM WEST Los Angeles             117   811       0.144 
##  8 AM WEST San Diego                65   448       0.145 
##  9 AM WEST Seattle                  61   262       0.233 
## 10 AM WEST San Francisco           129   449       0.287
ggplot(flight_subsum, aes(x = Airline, y = delayed_rate, fill = Destination)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Arrival Delay Percentage by Airline", x = "Airline", y = "Delayed_Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

flight_sum %>%
group_by(Airline) %>%
summarise(delayed_total = sum(delayed) , total = sum(total_flight), delayed_rate = sum(delayed) / sum(total_flight)) %>%
arrange(Airline)
## # A tibble: 2 × 4
##   Airline delayed_total total delayed_rate
##   <chr>           <dbl> <dbl>        <dbl>
## 1 ALASKA            501  3775        0.133
## 2 AM WEST           787  7225        0.109

Conclusion

Overall, although AM WEST has a higher total number of delayed flights compared to ALASKA, it doesn’t mean it performs worse than ALASKA. This difference is primarily due to its higher total number of flights. From the delayed rate, we can observe that AM WEST has a rate of only 0.109, which is lower than ALASKA’s rate of 0.133. In other words, AM WEST handles timely flights better than ALASKA. On the other hand, looking at the destinations, both airlines coincidentally have the highest number of delayed flights in San Francisco and the lowest in Phoenix. This indirectly reflects the busyness and air traffic situation at these destinations.