This assignment considers a small untidy data set of the flight timeliness of two airlines, American West and Alaska to five West Coast cities in the United States. The assignment requires the data management of a file and the subsequent analysis of the data content.
library(tidyverse)
library(RCurl)
I prepare a raw csv file in Excel based on the Assignment screenshot. The file is called “AirlineDelays.csv”. The command readr::read_csv allows the file to be loaded into a tibble.
rawFile = "AirlineDelays.csv"
z = read_csv(rawFile) # Display the tibble
knitr::kable(z, caption = "Raw Data Initially is untidy")
| X1 | X2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
To wrangle this data set, we split the work into two phases. * A preliminary clean-up phase: Remove irrelevant data, plug in missing values, provide meaningful column names. * A tidy data organization phase when the scrubbed data can be converted to a tidy structure using tidyr.
The preliminary clean-up can be broken down further. * We observe there is a blank row (4th row of the original file) that needs to be removed. It can be identified by having NA in all fields. We use filter on Phoenix to exclude it.
Next, we observe the column names X1, X2 provided by the read_csv function are uninformative. The result is stored in tibble z1 below.
Lastly, we observe the rows for delayed flights have no identifying airline. Filling in this value using the sequential structure of the file allows us to “gather” the data.
z %>% filter( !is.na(Phoenix) ) %>% rename( airline = X1, status = X2) -> z1
z1[2,"airline"] = z1[1, "airline"]
z1[4,"airline"] = z1[3, "airline"]
knitr::kable(z1, caption="Headers Added, Missing Values Supplied, Blank Rows deleted")
| airline | status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Now we come to the second phase of reordering the data into a tidy format. The desired format will consist of one table where an observation consists of: a single city, airline and its delayed and ontime flights. The variables consist of: airline, city, status (timeliness), ontime flights and delayed flights.
Later, we will supplement this table with derived statistics by using mutate.
z1 %>% gather( `Los Angeles`:`Seattle`, key = "airport", value = "num_flights" ) -> z2
Now we examine the gathered data set for a small subset of rows associated with Los Angeles below.
knitr::kable( filter( z2, airport=="Los Angeles") ,
caption="Gathered the airport from columns to values")
| airline | status | airport | num_flights |
|---|---|---|---|
| ALASKA | on time | Los Angeles | 497 |
| ALASKA | delayed | Los Angeles | 62 |
| AM WEST | on time | Los Angeles | 694 |
| AM WEST | delayed | Los Angeles | 117 |
We now want to “spread” the table so that the number of delayed flights and the number of on-time flights are variables not observations.
z2 %>% spread( key = status, value = num_flights ) -> z3
knitr::kable(filter ( z3, airport== "Los Angeles"), caption="Spread the on-time and delayed flight to columns")
| airline | airport | delayed | on time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| AM WEST | Los Angeles | 117 | 694 |
z3 %>% rename( on_time = `on time`) -> z4
knitr::kable(z4, caption = "Tidy Data at last" )
| airline | airport | delayed | on_time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
We are now ready to analyze the flights data. Let’s divide the analysis into 4 parts:
z4 %>% mutate( total_flights = delayed + on_time,
delay_rate = delayed / total_flights ) -> z5
knitr::kable( filter(z5, airport == "Los Angeles"), caption="Added 2 Columns")
| airline | airport | delayed | on_time | total_flights | delay_rate |
|---|---|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 | 559 | 0.1109123 |
| AM WEST | Los Angeles | 117 | 694 | 811 | 0.1442663 |
z5 %>% group_by(airport) %>% summarize( flights = sum(total_flights ),
delayed = sum(delayed) ,
avg_delay = sum(delayed) / sum(total_flights)
) %>% arrange( desc(flights) ) -> s1
knitr::kable( s1, caption="Airports Ranked by Total Flights with Delay Rates")
| airport | flights | delayed | avg_delay |
|---|---|---|---|
| Phoenix | 5488 | 427 | 0.0778061 |
| Seattle | 2408 | 366 | 0.1519934 |
| Los Angeles | 1370 | 179 | 0.1306569 |
| San Francisco | 1054 | 231 | 0.2191651 |
| San Diego | 680 | 85 | 0.1250000 |
In the above table, we see that Phoenix is the busiest airport in the sample. The California cities are the least busy of the sample.
z5 %>% group_by(airline) %>% summarize( flights = sum(total_flights ),
delayed = sum(delayed) ,
avg_delay = sum(delayed) / sum(total_flights)
) %>% arrange( desc(flights) ) -> s2
knitr::kable( s1, caption="Airlines Ranked by Total Flights with Delay Rates")
| airport | flights | delayed | avg_delay |
|---|---|---|---|
| Phoenix | 5488 | 427 | 0.0778061 |
| Seattle | 2408 | 366 | 0.1519934 |
| Los Angeles | 1370 | 179 | 0.1306569 |
| San Francisco | 1054 | 231 | 0.2191651 |
| San Diego | 680 | 85 | 0.1250000 |
The above table shows that America West had more flights than Alaska Airlines in the sample. Moreover, America West’s overall delay rate was lower at 10.89% versus 13.27% for Alaska.
We build the equivalent of an Excel pivot table in order to analyze and customize the analysis for each city.
From our tidy data in tibble z5, we pivot on airport and airline to show the delay_rate as the table value. Next, we add a custom field Winner which identifies for each city, which airline has the lower delay rate.
The surprising part of our analysis is that Alaska Airlines is the winner in every city listed. For example, in Los Angeles, Alaska has an 11.1% delay rate but America West is at 14.4%.
z5 %>%
select(airport, airline, delay_rate) %>%
spread(key = airline, value = delay_rate) %>%
mutate( Winner = if_else(ALASKA > `AM WEST`,
"AM WEST",
"ALASKA")) %>%
select( airport, Winner, everything()) -> s3
knitr::kable( s3, caption="Airport/Airline Individual Comparison", digits = 3)
| airport | Winner | ALASKA | AM WEST |
|---|---|---|---|
| Los Angeles | ALASKA | 0.111 | 0.144 |
| Phoenix | ALASKA | 0.052 | 0.079 |
| San Diego | ALASKA | 0.086 | 0.145 |
| San Francisco | ALASKA | 0.169 | 0.287 |
| Seattle | ALASKA | 0.142 | 0.233 |
We see the per-airport delay rates of Alaska are lower on average than America West in the scatter plot below. Alaska points are in red. The only outlier is America West delay rate in Phoenix where America West has the most flights. That drives the average delay rate down across all of America West’s destinations where the delay rate is uniformly higher.
library(ggrepel)
ggplot(data=z5) +
geom_point(mapping = aes( x= total_flights,
y = delay_rate, color=airline)) +
geom_text_repel( aes(label=airport, x=total_flights, y=delay_rate ) ) +
coord_cartesian(ylim=c(0,0.3))
In the chart below, we can see that all points are above the 45 degree line. This implies that by city, American West has longer delays than Alaska Airlines. We see that the deviation is worse for San Francisco and smallest for Los Angeles which is closest to the X-Y line.
ggplot(data=s3) +
geom_point(mapping = aes( x=ALASKA, y = `AM WEST`) ) +
coord_cartesian(xlim=c(0,0.3), ylim=c(0,0.3)) +
geom_text_repel( aes(label=airport, x=ALASKA, y=`AM WEST` ) ) +
labs( x="Delay Rate (Alaska)", y="Delay Rate (Am West)") +
geom_abline(intercept = 0, slope=1) +
ggtitle("Delay Rates: Alaska vs. Am West with X=Y line")
Let’s connect this data to the outside world. Based on Wikipedia, America West was merged into American Airlines in 2013 and ceased operations as a distinct entity in 2015 when its successor US Airways brand was retired. [https://en.wikipedia.org/wiki/America_West_Airlines] We learn that its headquarters is in Tempe Arizona and its base of operations in Phoenix. This explains why the greatest number of flights was in Phoenix.
However, America West has a history of bankruptcies and low employee wages compared to its competitors. It required them to put 20% of their salary into company stock.
By contrast, Alaska Airlines was voted the number 1 airline for customer satisfaction for eleven years in a row by J. D. Power and Associates. [https://en.wikipedia.org/wiki/Alaska_Airlines]. This is attributed to allowing employees to be empowered to make things right such as giving vouchers, free food and perks to customers and because flights are ontime. This perception is part of the company’s marketing and branding.
Moreover, the airline’s headquarters is in Seattle. This is consistent with the data where Seattle has Alaska’s largest volume of flights. The lower rates of delays for Alaska across all 5 cities is consistent with the J. D. Power survey.
We conclude that the raw flight and delay data can be used to paint divergent marketing stories. In the aggregate, American West could be viewed as having a better ontime record. However, Alaska Airlines beats Am West in every city. We conclude the set of cities were cherry picked to allow the divergent conclusions.