Week 5 Assignment

Description
Data Acquisition
Data Cleaning
The Code
Analysis

Description

The assignment is to load in an “untidy” .CSV file and transform it into a tidy dataset. The data contains the arrival data for two airlines. After the data has been wrangled I am to preform an analysis comparing the arrival delays for the two airlines, and publish my code and findings to rpubs and GitHub.

Data Acquisition

I have created a .CSV file that is identical to the assignment and have it hosted in my GitHub repository. I will read in the .CSV file and display it.

untidy_df <- read.csv('https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/flights.csv')

Here’s what the untidy data frame.

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

Data Cleaning

The Plan

Before starting on the data cleaning I think it is important to identify what needs to change in the data set. The name the first two variables needs to change. There is a blank row between the two airlines that needs to change. The dots for the airport names need to be replaced with spaces. This will evenutally become a categorical variable. The first to variables data types need to be changed from factors to characters. The first variable has some blanks that need to be filled too.

Once these changes are made the data needs to be normalized using the gather() command from tidyr. Knowing that I will want to create some visualizations I will convert the airline, arrival type and airport variables back into factors. It will make things easier to work with.

The Code

library(dplyr)
library(tidyr)

tidy_df <- untidy_df %>%
  rename(airline = X,
         arrival_type = X.1,
         "Los Angeles" = Los.Angeles,
         "San Diego" = San.Diego,
         "San Francisco" = San.Francisco)%>%
  mutate(airline = as.character(airline),
         arrival_type = as.character(arrival_type)) %>%
  na.omit() %>% # Drop the empty row
  mutate(airline = ifelse(nchar(airline)==0, lag(airline), airline)) %>% # Fill in the holes
  gather(., airport, flights, -airline, -arrival_type) %>% # Transform from wide to long
  mutate(airline = as.factor(airline),
         arrival_type = as.factor(arrival_type),
         airport = as.factor(airport))

Here’s what the table looks like.

airline	arrival_type	airport	flights
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20
AM WEST	on time	San Diego	383
AM WEST	delayed	San Diego	65
ALASKA	on time	San Francisco	503
ALASKA	delayed	San Francisco	102
AM WEST	on time	San Francisco	320
AM WEST	delayed	San Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

Analysis

There are a few questions that I want to ask this tidy data set. But before I begin I will compute a few totals, and load ggplot2

library(ggplot2)

total_flights_by_airline <- tidy_df %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

total_flights_by_airport <- tidy_df %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

total_flights_by_arrival_type <- tidy_df %>%
  group_by(arrival_type) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

Airline Examination

I want to begin by examining the airlines. What share of the flights are on time for each airline?

airline	arrival_type	flights	total_flights	share
ALASKA	delayed	501	3775	0.1327
ALASKA	on time	3274	3775	0.8673
AM WEST	delayed	787	7225	0.1089
AM WEST	on time	6438	7225	0.8911

It looks like AM West is on time 89% of the time on average. Alaska airlines is on time about 87% of the time. One also observes that AM West has more flights than Alaska airlines.

Is this difference in the on time/delayed share due to the airports the airlines fly to? Let’s examine the share of times the airline flies to an airport

tidy_df %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

So Alaska airlines flies to San Fancisco and Seatle more tha AM West. Let’s look at this picture again by comparing across all flights the airline takes.

tidy_df %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Airline and Airports") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

This shows a litle different picture. 57% of Alaska airlines flights are to Seatle, which makes sense. AM West flies to Phoenix most often then LA.

Airport Examination

So lets take a closer look at the airports Let’s see if there is a difference between the arrival times and the airport.

tidy_df %>%
  group_by(airport, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, arrival_type)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Arrival Type and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

The Phoenix airport has the best record for on time arrivals and San Francisco has the worst.

Let’s look at the on time arrivals. How do these two airlines stack up when we look at the airports. If each airlines are the same.

on_time_flights_by_airport <- tidy_df %>%
  filter(arrival_type == "on time") %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

tidy_df %>%
  filter(arrival_type == "on time") %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(on_time_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of On Time Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

Most of the on time flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.

delayed_flights_by_airport <- tidy_df %>%
  filter(arrival_type == "delayed") %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

tidy_df %>%
  filter(arrival_type == "delayed") %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(delayed_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Delayed Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

Most of the delayed flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.

What If Scenario

Now I want to do a what if scenario. What if these two airlines don’t fly to Phoenix or Seattle. How would that change these airline on time arrivals?

what_if <- tidy_df %>%
  filter(!airport %in% c("Phoenix", "Seatle"))

what_if_total_flights_by_airline <- what_if %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

temp <- what_if %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(what_if_total_flights_by_airline) %>%
  mutate(what_if_share = flights / total_flights) %>%
  select(airline, arrival_type, what_if_share)

scenario <- tidy_df %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  select(airline, arrival_type, share) %>%
  merge(temp) %>%
  mutate(delta = what_if_share - share)

airline	arrival_type	share	what_if_share	delta
ALASKA	delayed	0.1327	0.1381	0.005342
ALASKA	on time	0.8673	0.8619	-0.005342
AM WEST	delayed	0.1089	0.1888	0.07991
AM WEST	on time	0.8911	0.8112	-0.07991

We see that this would not change the airline’s ratings much. Alaska Airlines is basically unchange and AM West changed by less than 1 percentage point.

What if they only flew to Phoenix or Seatle? What would that do?

what_if <- tidy_df %>%
  filter(airport %in% c("Phoenix", "Seatle"))

what_if_total_flights_by_airline <- what_if %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

temp <- what_if %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(what_if_total_flights_by_airline) %>%
  mutate(what_if_share = flights / total_flights) %>%
  select(airline, arrival_type, what_if_share)

scenario <- tidy_df %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  select(airline, arrival_type, share) %>%
  merge(temp) %>%
  mutate(delta = what_if_share - share)

airline	arrival_type	share	what_if_share	delta
ALASKA	delayed	0.1327	0.0515	-0.08121
ALASKA	on time	0.8673	0.9485	0.08121
AM WEST	delayed	0.1089	0.07897	-0.02995
AM WEST	on time	0.8911	0.921	0.02995

That would give Alaska Airlines on time arrivals a boost of 8 percentage points and AM West a 3 percentage point boost.

Optimization

Let’s suppose that these two airlines live and die by their on time arrival rating. What set of airports would then need to fly to to get their highest rating?

I will use the combn function to build out the list of all possible combinations. I will loop through these combinations and run what if scenarios. If the on time arrival share is bigger than the

# Initial settings for on time arrival share
alaska_on_time <- 0
am_west_on_time <- 0

# Get a vector of airports
airports <- tidy_df$airport %>%
  unique() %>%
  as.vector()

# Loop through all combinations
for(m in length(airports):1){
  airport_combinations <- combn(airports, m)
  # Run a what if scenario for the combination
  for(col in 1:dim(airport_combinations)[2]){
    what_if <- tidy_df %>%
      filter(airport %in% airport_combinations[, col])
    
    what_if_total_flights_by_airline <- what_if %>%
      group_by(airline) %>%
      summarise(total_flights = sum(flights)) %>%
      ungroup()
    
    temp <- what_if %>%
      filter(arrival_type == "on time") %>%
      group_by(airline) %>%
      summarise(flights = sum(flights)) %>%
      merge(what_if_total_flights_by_airline) %>%
      mutate(what_if_share = flights / total_flights) %>%
      select(airline, what_if_share)
    
    # See if the on time share is larger and if so save the airport set 
    if(temp[1,2] > alaska_on_time){
      alaska_on_time <- temp[1, 2]
      alaska_optimal <- airport_combinations[, col]
    }
    
    if(temp[2,2] > am_west_on_time){
      am_west_on_time <- temp[2, 2]
      am_west_optimal <- airport_combinations[, col]
    }
  }
}

So currently Alaska airlines and AM West has an on time arrival 86.73% and 89.11% respectively. If they selected the optimal set of airports to fly into the on time arrival would increase to 94.85% and 92.1%.

Alaska Airlines would only fly to Phoenix and AM West would only fly to Phoenix.