Task is to create a csv file that contain flight info by two airline company. we will intentionally create an untidy data set first, and do some data cleaning on data set structure, handle some missing data, etc. Save the tidy_data as a new csv file and perform data analysis.
For analysis, a bar chart will be used to show percentage and flight counts of both delays or arrival rates for two airlines overall. A table will be used to show result of percentage and flight counts of both delays or arrival rates for two airlines across five cities.
Loading Data
A copy of csv was created with the same format from the assignment, uploaded to the github repository. Later a tidy version will also be uploaded as well.
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 delayed 62 12 20 102 305
3 NA NA NA NA NA
4 AM WEST on time 694 4840 383 320 201
5 delayed 117 415 65 129 61
Tidy Data
Original data used “wide” structure, multiple destinations were spread out in multiple columns, some rows missing values, or some columns missing name. Here are procedures to be applied to tidy the data: add names to the first two columns: airline company and flight status; Remove any rows with missing values, convert all numerical strings to be of numeric type by removing comma in some numbers; Use the destination names to convert the data from wide to long, for consistency fill in the airline names on rows that have missing values and do the same for flight status.
colnames(untidy_data)[1] ="airline"colnames(untidy_data)[2] ="flight Status"# Missing data, remove row of blanksdestination_names <-colnames(untidy_data[3:length(untidy_data)])untidy_data <- untidy_data |>na.omit()# Remove the comma from some large numbers untidy_data <- untidy_data |>mutate(across(all_of(destination_names), ~as.numeric(gsub(",", "", .))))print(untidy_data)
airline flight Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 delayed 117 415 65 129 61
Fill in missing airline values by fill downwards using airlines from above rows.
# For consistency, combine all destination under new column perspective flight into "Num Flights" columnlong_data <- untidy_data |>pivot_longer(cols =all_of(destination_names), names_to ="destination", values_to ="num flights")# Clean column names using janitor if (require(janitor, quietly =TRUE)) { long_data <- long_data |> janitor::clean_names()} else {colnames(long_data) <-tolower(gsub("[^[:alnum:]]", "_", colnames(long_data)))}# Fill missing airline values, Convert empty airline strings to NA then fill downwards. tidy_data <- long_data |>mutate(airline =ifelse(airline ==""|is.na(airline), NA, airline) ) |>fill(airline) |>mutate(destination =str_replace(destination, "\\.", " ") )print(tidy_data)
# A tibble: 20 × 4
airline flight_status destination num_flights
<chr> <chr> <chr> <dbl>
1 ALASKA on time Los Angeles 497
2 ALASKA on time Phoenix 221
3 ALASKA on time San Diego 212
4 ALASKA on time San Francisco 503
5 ALASKA on time Seattle 1841
6 ALASKA delayed Los Angeles 62
7 ALASKA delayed Phoenix 12
8 ALASKA delayed San Diego 20
9 ALASKA delayed San Francisco 102
10 ALASKA delayed Seattle 305
11 AM WEST on time Los Angeles 694
12 AM WEST on time Phoenix 4840
13 AM WEST on time San Diego 383
14 AM WEST on time San Francisco 320
15 AM WEST on time Seattle 201
16 AM WEST delayed Los Angeles 117
17 AM WEST delayed Phoenix 415
18 AM WEST delayed San Diego 65
19 AM WEST delayed San Francisco 129
20 AM WEST delayed Seattle 61
After transform data from wide to long format, save a copy of tidy_data in the same folder and github repository.
write.csv(tidy_data, "tidy_data.csv", row.names =FALSE)print("File saved as 'tidy_data.csv' in the current folder")
[1] "File saved as 'tidy_data.csv' in the current folder"
Data Analysis
Stacked bar chart to compare percentage of delays and on-time rate for tow airlines overall. As showing in the result below, Alaska airline has less flight counts overall and seems to have higher delay rate, 13%.
List statistics in a table to compare percentage of delays and arrival rates for tow airlines across five cities. Across all destination, Alaska airline actually has lower delay percentages than AM West airline. This seems to be a different conclusion than the bar chart above.
Flight Performance Comparison by Airline and Destination
Flight Counts
Percentages
Airline
Destination
Total Flights
Delayed
On Time
Delay %
On Time %
ALASKA
Los Angeles
559
62
497
11
89
AM WEST
Los Angeles
811
117
694
14
86
ALASKA
Phoenix
233
12
221
5
95
AM WEST
Phoenix
5255
415
4840
8
92
ALASKA
San Diego
232
20
212
9
91
AM WEST
San Diego
448
65
383
15
85
ALASKA
San Francisco
605
102
503
17
83
AM WEST
San Francisco
449
129
320
29
71
ALASKA
Seattle
2146
305
1841
14
86
AM WEST
Seattle
262
61
201
23
77
Summary
Two analysis appear to have two different conclusion. Bar chart shows overall AM West has lower delayed rate (11%) compare to Alaska (13%), however different conclusion can be seen from the table where AM had higher delayed rate across all destinations. This is an example of the topic we discussed during class - Simpson’s paradox.
After closer look, main reason of the discrepancy came caused by the result in destination Phoenix. In phoenix, AM West had vast majority (73%) of its flights operate on this route and only 8% delay, whereas Alaska’s flight volume is more evenly distributed across cities with lower delay rate at every individual destination. The result in Phoenix made AM West appear to have less delay rate overall.