Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. Use a “wide” structure similar to how the information appears below, so that tidying and transformations can be used on the data.
# Libraries to be used
library(tidyverse)Read the arrival data from your .CSV file
src_file_path <- "C:\\tmp\\Airline_Arrival_Delays.csv"
untidy_flights <- read_csv(src_file_path)Take a peak at the data read from the CSV file.
| Airline | Flight Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Use “tidyr” and “dplyr” as needed to tidy up and transform the data
flights <- untidy_flights %>%
# remove blank lines
drop_na() %>%
# change format from wide to long to tidy up data
pivot_longer(`Los Angeles`:`Seattle`, names_to = "destination", values_to = "count") %>%
# rename column names
rename(airline = Airline, flight_status = `Flight Status`)Show a glimpse of the transformed data.
| airline | flight_status | destination | count |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | on time | San Diego | 212 |
| ALASKA | on time | San Francisco | 503 |
| ALASKA | on time | Seattle | 1841 |
| ALASKA | delayed | Los Angeles | 62 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | delayed | San Diego | 20 |
| ALASKA | delayed | San Francisco | 102 |
| ALASKA | delayed | Seattle | 305 |
Perform data analysis to compare the arrival delays for the two airlines.
delayed_flights <- flights %>%
filter(flight_status == "delayed")Which airline has the most overall delays and which one the least?
by_airline <- delayed_flights %>%
group_by(airline) %>%
summarise(total_count = sum(count)) %>%
arrange(desc(total_count))Overall, “AM West” airline has the highest number of delays compared to “ALASKA”, which has the lowest.
| airline | total_count |
|---|---|
| AM WEST | 787 |
| ALASKA | 501 |
We can see the discrepancy between each airline’s delays graphically
ggplot(by_airline, aes(x = airline, y = total_count, fill = airline)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
ggtitle("Total Overall Flight Delays Per Airline") +
labs(x = "Airport Destinations", y = "Total Delays") +
theme(legend.position = "none")Which airport destinations had the most and least delays?
by_destination <- delayed_flights %>%
group_by(destination, airline) %>%
summarise(total_count = sum(count)) %>%
arrange(desc(total_count))Overall, the Phoenix destination has highest number of delays (“AM West” airline) as well as the lowest number of delays (“ALASKA” airline).
| destination | airline | total_count |
|---|---|---|
| Phoenix | AM WEST | 415 |
| Seattle | ALASKA | 305 |
| San Francisco | AM WEST | 129 |
| Los Angeles | AM WEST | 117 |
| San Francisco | ALASKA | 102 |
| San Diego | AM WEST | 65 |
| Los Angeles | ALASKA | 62 |
| Seattle | AM WEST | 61 |
| San Diego | ALASKA | 20 |
| Phoenix | ALASKA | 12 |
Below is the graphical representation of delays per destination in descending order.
ggplot(by_destination, aes(x = reorder(destination, total_count), y = total_count, fill = airline)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
ggtitle("Total Flight Delays Per Destination Per Airline") +
labs(x = "Airport Destinations", y = "Total Delays")It was demonstrated that the packages “tidyr” and “dplyr” can be extremely useful to tidy up and transform the data.