Overview

This assignment considers a small untidy data set of the flight timeliness of two airlines, American West and Alaska to five West Coast cities in the United States. The assignment requires the data management of a file and the subsequent analysis of the data content.

library(tidyverse)
library(RCurl)

Data Management

I prepare a raw csv file in Excel based on the Assignment screenshot. The file is called “AirlineDelays.csv”. The command readr::read_csv allows the file to be loaded into a tibble.

rawFile = "AirlineDelays.csv"

z = read_csv(rawFile)   # Display the tibble
knitr::kable(z, caption = "Raw Data Initially is untidy")
Raw Data Initially is untidy
X1 X2 Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Wrangling to Tidy Structure

To wrangle this data set, we split the work into two phases. * A preliminary clean-up phase: Remove irrelevant data, plug in missing values, provide meaningful column names. * A tidy data organization phase when the scrubbed data can be converted to a tidy structure using tidyr.

The preliminary clean-up can be broken down further. * We observe there is a blank row (4th row of the original file) that needs to be removed. It can be identified by having NA in all fields. We use filter on Phoenix to exclude it.

  • Next, we observe the column names X1, X2 provided by the read_csv function are uninformative. The result is stored in tibble z1 below.

  • Lastly, we observe the rows for delayed flights have no identifying airline. Filling in this value using the sequential structure of the file allows us to “gather” the data.

z %>% filter( !is.na(Phoenix) ) %>% rename( airline = X1, status = X2) -> z1

z1[2,"airline"] = z1[1, "airline"]
z1[4,"airline"] = z1[3, "airline"]

knitr::kable(z1, caption="Headers Added, Missing Values Supplied, Blank Rows deleted")
Headers Added, Missing Values Supplied, Blank Rows deleted
airline status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Building the Tidy Data

Now we come to the second phase of reordering the data into a tidy format. The desired format will consist of one table where an observation consists of: a single city, airline and its delayed and ontime flights. The variables consist of: airline, city, status (timeliness), ontime flights and delayed flights.

Later, we will supplement this table with derived statistics by using mutate.

z1 %>% gather( `Los Angeles`:`Seattle`, key = "airport", value = "num_flights" ) -> z2

Now we examine the gathered data set for a small subset of rows associated with Los Angeles below.

knitr::kable( filter( z2, airport=="Los Angeles") , 
              caption="Gathered the airport from columns to values")
Gathered the airport from columns to values
airline status airport num_flights
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117

We now want to “spread” the table so that the number of delayed flights and the number of on-time flights are variables not observations.

z2 %>% spread( key = status, value = num_flights ) -> z3

knitr::kable(filter ( z3, airport== "Los Angeles"), caption="Spread the on-time and delayed flight to columns")
Spread the on-time and delayed flight to columns
airline airport delayed on time
ALASKA Los Angeles 62 497
AM WEST Los Angeles 117 694
z3 %>% rename( on_time = `on time`) -> z4

knitr::kable(z4, caption = "Tidy Data at last"  )
Tidy Data at last
airline airport delayed on_time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

Methodology

We are now ready to analyze the flights data. Let’s divide the analysis into 4 parts:

  1. Supplement existing data with delay ratios and total flights. 2.Descriptive statistics and sortings by relevant categories: city, airline, volumes
  2. Plots and visualizations of delay rates, and patterns or trends
  3. Explore and link with External Information: Economics, Geography
z4 %>% mutate( total_flights = delayed + on_time, 
               delay_rate = delayed / total_flights ) -> z5

knitr::kable( filter(z5, airport == "Los Angeles"), caption="Added 2 Columns")
Added 2 Columns
airline airport delayed on_time total_flights delay_rate
ALASKA Los Angeles 62 497 559 0.1109123
AM WEST Los Angeles 117 694 811 0.1442663

Results

Which airport and airline is busiest?

z5 %>% group_by(airport) %>% summarize( flights = sum(total_flights ), 
                                        delayed = sum(delayed) ,
                                        avg_delay = sum(delayed) / sum(total_flights)
                                        ) %>% arrange( desc(flights) ) -> s1

knitr::kable( s1, caption="Airports Ranked by Total Flights with Delay Rates")
Airports Ranked by Total Flights with Delay Rates
airport flights delayed avg_delay
Phoenix 5488 427 0.0778061
Seattle 2408 366 0.1519934
Los Angeles 1370 179 0.1306569
San Francisco 1054 231 0.2191651
San Diego 680 85 0.1250000

In the above table, we see that Phoenix is the busiest airport in the sample. The California cities are the least busy of the sample.

z5 %>% group_by(airline) %>% summarize( flights = sum(total_flights ), 
                                        delayed = sum(delayed) ,
                                        avg_delay = sum(delayed) / sum(total_flights)
                                        ) %>% arrange( desc(flights) ) -> s2

knitr::kable( s1, caption="Airlines Ranked by Total Flights with Delay Rates")
Airlines Ranked by Total Flights with Delay Rates
airport flights delayed avg_delay
Phoenix 5488 427 0.0778061
Seattle 2408 366 0.1519934
Los Angeles 1370 179 0.1306569
San Francisco 1054 231 0.2191651
San Diego 680 85 0.1250000

The above table shows that America West had more flights than Alaska Airlines in the sample. Moreover, America West’s overall delay rate was lower at 10.89% versus 13.27% for Alaska.

Which airline should I choose for ontime Flight?

We build the equivalent of an Excel pivot table in order to analyze and customize the analysis for each city.

From our tidy data in tibble z5, we pivot on airport and airline to show the delay_rate as the table value. Next, we add a custom field Winner which identifies for each city, which airline has the lower delay rate.

The surprising part of our analysis is that Alaska Airlines is the winner in every city listed. For example, in Los Angeles, Alaska has an 11.1% delay rate but America West is at 14.4%.

z5 %>% 
  select(airport, airline, delay_rate) %>% 
  spread(key = airline, value = delay_rate) %>% 
  mutate( Winner = if_else(ALASKA > `AM WEST`, 
                           "AM WEST", 
                           "ALASKA")) %>% 
  select( airport, Winner, everything()) -> s3

knitr::kable( s3, caption="Airport/Airline Individual Comparison", digits = 3)
Airport/Airline Individual Comparison
airport Winner ALASKA AM WEST
Los Angeles ALASKA 0.111 0.144
Phoenix ALASKA 0.052 0.079
San Diego ALASKA 0.086 0.145
San Francisco ALASKA 0.169 0.287
Seattle ALASKA 0.142 0.233

We see the per-airport delay rates of Alaska are lower on average than America West in the scatter plot below. Alaska points are in red. The only outlier is America West delay rate in Phoenix where America West has the most flights. That drives the average delay rate down across all of America West’s destinations where the delay rate is uniformly higher.

library(ggrepel)

ggplot(data=z5) + 
  geom_point(mapping = aes( x= total_flights, 
                            y = delay_rate, color=airline)) + 
  geom_text_repel( aes(label=airport, x=total_flights, y=delay_rate  ) ) +
  coord_cartesian(ylim=c(0,0.3))

In the chart below, we can see that all points are above the 45 degree line. This implies that by city, American West has longer delays than Alaska Airlines. We see that the deviation is worse for San Francisco and smallest for Los Angeles which is closest to the X-Y line.

ggplot(data=s3) +
  geom_point(mapping = aes( x=ALASKA, y = `AM WEST`) ) +
  coord_cartesian(xlim=c(0,0.3), ylim=c(0,0.3)) +
  geom_text_repel( aes(label=airport, x=ALASKA, y=`AM WEST`  ) ) +
  labs( x="Delay Rate (Alaska)", y="Delay Rate (Am West)") +
  geom_abline(intercept = 0, slope=1) + 
  ggtitle("Delay Rates: Alaska vs. Am West with X=Y line")

Discussion

Let’s connect this data to the outside world. Based on Wikipedia, America West was merged into American Airlines in 2013 and ceased operations as a distinct entity in 2015 when its successor US Airways brand was retired. [https://en.wikipedia.org/wiki/America_West_Airlines] We learn that its headquarters is in Tempe Arizona and its base of operations in Phoenix. This explains why the greatest number of flights was in Phoenix.

However, America West has a history of bankruptcies and low employee wages compared to its competitors. It required them to put 20% of their salary into company stock.

By contrast, Alaska Airlines was voted the number 1 airline for customer satisfaction for eleven years in a row by J. D. Power and Associates. [https://en.wikipedia.org/wiki/Alaska_Airlines]. This is attributed to allowing employees to be empowered to make things right such as giving vouchers, free food and perks to customers and because flights are ontime. This perception is part of the company’s marketing and branding.

Moreover, the airline’s headquarters is in Seattle. This is consistent with the data where Seattle has Alaska’s largest volume of flights. The lower rates of delays for Alaska across all 5 cities is consistent with the J. D. Power survey.

Conclusion

We conclude that the raw flight and delay data can be used to paint divergent marketing stories. In the aggregate, American West could be viewed as having a better ontime record. However, Alaska Airlines beats Am West in every city. We conclude the set of cities were cherry picked to allow the divergent conclusions.