DATA 607 Assignment 5 Tidying

Overview
Data Management
- Wrangling to Tidy Structure
- Building the Tidy Data
Methodology
Results
- Which airport and airline is busiest?
- Which airline should I choose for ontime Flight?
Discussion
Conclusion

Overview

This assignment considers a small untidy data set of the flight timeliness of two airlines, American West and Alaska to five West Coast cities in the United States. The assignment requires the data management of a file and the subsequent analysis of the data content.

library(tidyverse)
library(RCurl)

Data Management

I prepare a raw csv file in Excel based on the Assignment screenshot. The file is called “AirlineDelays.csv”. The command readr::read_csv allows the file to be loaded into a tibble.

rawFile = "AirlineDelays.csv"

z = read_csv(rawFile)   # Display the tibble
knitr::kable(z, caption = "Raw Data Initially is untidy")

Raw Data Initially is untidy
X1	X2	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

Wrangling to Tidy Structure

To wrangle this data set, we split the work into two phases. * A preliminary clean-up phase: Remove irrelevant data, plug in missing values, provide meaningful column names. * A tidy data organization phase when the scrubbed data can be converted to a tidy structure using tidyr.

The preliminary clean-up can be broken down further. * We observe there is a blank row (4th row of the original file) that needs to be removed. It can be identified by having NA in all fields. We use filter on Phoenix to exclude it.

Next, we observe the column names X1, X2 provided by the read_csv function are uninformative. The result is stored in tibble z1 below.
Lastly, we observe the rows for delayed flights have no identifying airline. Filling in this value using the sequential structure of the file allows us to “gather” the data.

z %>% filter( !is.na(Phoenix) ) %>% rename( airline = X1, status = X2) -> z1

z1[2,"airline"] = z1[1, "airline"]
z1[4,"airline"] = z1[3, "airline"]

knitr::kable(z1, caption="Headers Added, Missing Values Supplied, Blank Rows deleted")

Headers Added, Missing Values Supplied, Blank Rows deleted
airline	status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Building the Tidy Data

Now we come to the second phase of reordering the data into a tidy format. The desired format will consist of one table where an observation consists of: a single city, airline and its delayed and ontime flights. The variables consist of: airline, city, status (timeliness), ontime flights and delayed flights.

Later, we will supplement this table with derived statistics by using mutate.

z1 %>% gather( `Los Angeles`:`Seattle`, key = "airport", value = "num_flights" ) -> z2

Now we examine the gathered data set for a small subset of rows associated with Los Angeles below.

knitr::kable( filter( z2, airport=="Los Angeles") , 
              caption="Gathered the airport from columns to values")

Gathered the airport from columns to values
airline	status	airport	num_flights
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117

We now want to “spread” the table so that the number of delayed flights and the number of on-time flights are variables not observations.

z2 %>% spread( key = status, value = num_flights ) -> z3

knitr::kable(filter ( z3, airport== "Los Angeles"), caption="Spread the on-time and delayed flight to columns")

Spread the on-time and delayed flight to columns
airline	airport	delayed	on time
ALASKA	Los Angeles	62	497
AM WEST	Los Angeles	117	694

z3 %>% rename( on_time = `on time`) -> z4

knitr::kable(z4, caption = "Tidy Data at last"  )

Tidy Data at last
airline	airport	delayed	on_time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

Methodology

We are now ready to analyze the flights data. Let’s divide the analysis into 4 parts:

Supplement existing data with delay ratios and total flights. 2.Descriptive statistics and sortings by relevant categories: city, airline, volumes
Plots and visualizations of delay rates, and patterns or trends
Explore and link with External Information: Economics, Geography

z4 %>% mutate( total_flights = delayed + on_time, 
               delay_rate = delayed / total_flights ) -> z5

knitr::kable( filter(z5, airport == "Los Angeles"), caption="Added 2 Columns")

Added 2 Columns
airline	airport	delayed	on_time	total_flights	delay_rate
ALASKA	Los Angeles	62	497	559	0.1109123
AM WEST	Los Angeles	117	694	811	0.1442663

Results

Which airport and airline is busiest?

z5 %>% group_by(airport) %>% summarize( flights = sum(total_flights ), 
                                        delayed = sum(delayed) ,
                                        avg_delay = sum(delayed) / sum(total_flights)
                                        ) %>% arrange( desc(flights) ) -> s1

knitr::kable( s1, caption="Airports Ranked by Total Flights with Delay Rates")

Airports Ranked by Total Flights with Delay Rates
airport	flights	delayed	avg_delay
Phoenix	5488	427	0.0778061
Seattle	2408	366	0.1519934
Los Angeles	1370	179	0.1306569
San Francisco	1054	231	0.2191651
San Diego	680	85	0.1250000

In the above table, we see that Phoenix is the busiest airport in the sample. The California cities are the least busy of the sample.

z5 %>% group_by(airline) %>% summarize( flights = sum(total_flights ), 
                                        delayed = sum(delayed) ,
                                        avg_delay = sum(delayed) / sum(total_flights)
                                        ) %>% arrange( desc(flights) ) -> s2

knitr::kable( s1, caption="Airlines Ranked by Total Flights with Delay Rates")

Airlines Ranked by Total Flights with Delay Rates
airport	flights	delayed	avg_delay
Phoenix	5488	427	0.0778061
Seattle	2408	366	0.1519934
Los Angeles	1370	179	0.1306569
San Francisco	1054	231	0.2191651
San Diego	680	85	0.1250000

The above table shows that America West had more flights than Alaska Airlines in the sample. Moreover, America West’s overall delay rate was lower at 10.89% versus 13.27% for Alaska.

Which airline should I choose for ontime Flight?

We build the equivalent of an Excel pivot table in order to analyze and customize the analysis for each city.

From our tidy data in tibble z5, we pivot on airport and airline to show the delay_rate as the table value. Next, we add a custom field Winner which identifies for each city, which airline has the lower delay rate.

The surprising part of our analysis is that Alaska Airlines is the winner in every city listed. For example, in Los Angeles, Alaska has an 11.1% delay rate but America West is at 14.4%.

z5 %>% 
  select(airport, airline, delay_rate) %>% 
  spread(key = airline, value = delay_rate) %>% 
  mutate( Winner = if_else(ALASKA > `AM WEST`, 
                           "AM WEST", 
                           "ALASKA")) %>% 
  select( airport, Winner, everything()) -> s3

knitr::kable( s3, caption="Airport/Airline Individual Comparison", digits = 3)

Airport/Airline Individual Comparison
airport	Winner	ALASKA	AM WEST
Los Angeles	ALASKA	0.111	0.144
Phoenix	ALASKA	0.052	0.079
San Diego	ALASKA	0.086	0.145
San Francisco	ALASKA	0.169	0.287
Seattle	ALASKA	0.142	0.233

We see the per-airport delay rates of Alaska are lower on average than America West in the scatter plot below. Alaska points are in red. The only outlier is America West delay rate in Phoenix where America West has the most flights. That drives the average delay rate down across all of America West’s destinations where the delay rate is uniformly higher.

library(ggrepel)

ggplot(data=z5) + 
  geom_point(mapping = aes( x= total_flights, 
                            y = delay_rate, color=airline)) + 
  geom_text_repel( aes(label=airport, x=total_flights, y=delay_rate  ) ) +
  coord_cartesian(ylim=c(0,0.3))

In the chart below, we can see that all points are above the 45 degree line. This implies that by city, American West has longer delays than Alaska Airlines. We see that the deviation is worse for San Francisco and smallest for Los Angeles which is closest to the X-Y line.

ggplot(data=s3) +
  geom_point(mapping = aes( x=ALASKA, y = `AM WEST`) ) +
  coord_cartesian(xlim=c(0,0.3), ylim=c(0,0.3)) +
  geom_text_repel( aes(label=airport, x=ALASKA, y=`AM WEST`  ) ) +
  labs( x="Delay Rate (Alaska)", y="Delay Rate (Am West)") +
  geom_abline(intercept = 0, slope=1) + 
  ggtitle("Delay Rates: Alaska vs. Am West with X=Y line")

Discussion

Let’s connect this data to the outside world. Based on Wikipedia, America West was merged into American Airlines in 2013 and ceased operations as a distinct entity in 2015 when its successor US Airways brand was retired. [https://en.wikipedia.org/wiki/America_West_Airlines] We learn that its headquarters is in Tempe Arizona and its base of operations in Phoenix. This explains why the greatest number of flights was in Phoenix.

However, America West has a history of bankruptcies and low employee wages compared to its competitors. It required them to put 20% of their salary into company stock.

By contrast, Alaska Airlines was voted the number 1 airline for customer satisfaction for eleven years in a row by J. D. Power and Associates. [https://en.wikipedia.org/wiki/Alaska_Airlines]. This is attributed to allowing employees to be empowered to make things right such as giving vouchers, free food and perks to customers and because flights are ontime. This perception is part of the company’s marketing and branding.

Moreover, the airline’s headquarters is in Seattle. This is consistent with the data where Seattle has Alaska’s largest volume of flights. The lower rates of delays for Alaska across all 5 cities is consistent with the J. D. Power survey.

Conclusion

We conclude that the raw flight and delay data can be used to paint divergent marketing stories. In the aggregate, American West could be viewed as having a better ontime record. However, Alaska Airlines beats Am West in every city. We conclude the set of cities were cherry picked to allow the divergent conclusions.