This assignment is to observe and compare the arrival delays of flights for two airlines at five destinations. Also, I need to organize the provided chart and reshape the data into a structured data frame for use.
Create CSV file by mirroring all of the information from image.
Input file:"Image_dataframe.csv” from https://github.com/ZIXIANNOW/DATA607week5/blob/main/Image_dataframe.csv
url <- 'https://raw.githubusercontent.com/ZIXIANNOW/DATA607week5/main/Image_dataframe.csv'
database <- read_csv(url)
## New names:
## Rows: 5 Columns: 7
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): ...1, ...2 dbl (5): Los Angeles, Phoenix, San Diego, San Francisco,
## Seattle
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
## • `` -> `...2`
database
## # A tibble: 5 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
There is one separator line should be removed and two airline names are missing that should be added back.
data_clean <- database[!apply(is.na(database) | database == "NA", 1, all), ]
data_clean
## # A tibble: 4 × 7
## ...1 ...2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
names(data_clean)[1] <- "Airline"
names(data_clean)[2] <- "Flight_Status"
first_na_index <- which(is.na(data_clean$Airline))[1]
second_na_index <- which(is.na(data_clean$Airline))[2]
data_clean$Airline[first_na_index] <- "ALASKA"
data_clean$Airline[second_na_index] <- "AM WEST"
data_clean
## # A tibble: 4 × 7
## Airline Flight_Status `Los Angeles` Phoenix `San Diego` `San Francisco`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 3 AM WEST on time 694 4840 383 320
## 4 AM WEST delayed 117 415 65 129
## # ℹ 1 more variable: Seattle <dbl>
From here, will mutate variables for better counting.
flights <- pivot_longer(data_clean, cols = 3:7, names_to = "Destination", values_to = "Count")
flights
## # A tibble: 20 × 4
## Airline Flight_Status Destination Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San Diego 212
## 4 ALASKA on time San Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San Diego 20
## 9 ALASKA delayed San Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San Diego 383
## 14 AM WEST on time San Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San Diego 65
## 19 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
flight_update <- flights %>%
pivot_wider(names_from = Flight_Status, values_from = Count)
flight_update
## # A tibble: 10 × 4
## Airline Destination `on time` delayed
## <chr> <chr> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San Diego 212 20
## 4 ALASKA San Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San Diego 383 65
## 9 AM WEST San Francisco 320 129
## 10 AM WEST Seattle 201 61
Now adding additional columns for more intuitive comparison.
flight_sum <- flight_update %>%
mutate(total_flight = rowSums(select(.,"on time", "delayed")))
flight_sum
## # A tibble: 10 × 5
## Airline Destination `on time` delayed total_flight
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62 559
## 2 ALASKA Phoenix 221 12 233
## 3 ALASKA San Diego 212 20 232
## 4 ALASKA San Francisco 503 102 605
## 5 ALASKA Seattle 1841 305 2146
## 6 AM WEST Los Angeles 694 117 811
## 7 AM WEST Phoenix 4840 415 5255
## 8 AM WEST San Diego 383 65 448
## 9 AM WEST San Francisco 320 129 449
## 10 AM WEST Seattle 201 61 262
flight_subsum <- flight_sum %>%
group_by(Airline,Destination) %>%
summarise(delayed_total = sum(delayed) , total = sum(total_flight), delayed_rate = sum(delayed) / sum(total_flight)) %>%
arrange(Airline,delayed_rate)
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
flight_subsum
## # A tibble: 10 × 5
## # Groups: Airline [2]
## Airline Destination delayed_total total delayed_rate
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Phoenix 12 233 0.0515
## 2 ALASKA San Diego 20 232 0.0862
## 3 ALASKA Los Angeles 62 559 0.111
## 4 ALASKA Seattle 305 2146 0.142
## 5 ALASKA San Francisco 102 605 0.169
## 6 AM WEST Phoenix 415 5255 0.0790
## 7 AM WEST Los Angeles 117 811 0.144
## 8 AM WEST San Diego 65 448 0.145
## 9 AM WEST Seattle 61 262 0.233
## 10 AM WEST San Francisco 129 449 0.287
ggplot(flight_subsum, aes(x = Airline, y = delayed_rate, fill = Destination)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Arrival Delay Percentage by Airline", x = "Airline", y = "Delayed_Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
flight_sum %>%
group_by(Airline) %>%
summarise(delayed_total = sum(delayed) , total = sum(total_flight), delayed_rate = sum(delayed) / sum(total_flight)) %>%
arrange(Airline)
## # A tibble: 2 × 4
## Airline delayed_total total delayed_rate
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 501 3775 0.133
## 2 AM WEST 787 7225 0.109
Overall, although AM WEST has a higher total number of delayed flights compared to ALASKA, it doesn’t mean it performs worse than ALASKA. This difference is primarily due to its higher total number of flights. From the delayed rate, we can observe that AM WEST has a rate of only 0.109, which is lower than ALASKA’s rate of 0.133. In other words, AM WEST handles timely flights better than ALASKA. On the other hand, looking at the destinations, both airlines coincidentally have the highest number of delayed flights in San Francisco and the lowest in Phoenix. This indirectly reflects the busyness and air traffic situation at these destinations.