Week 5 Assignment

Description

The assignment is to load in an “untidy” .CSV file and transform it into a tidy dataset. The data contains the arrival data for two airlines. After the data has been wrangled I am to preform an analysis comparing the arrival delays for the two airlines, and publish my code and findings to rpubs and GitHub.

Data Acquisition

I have created a .CSV file that is identical to the assignment and have it hosted in my GitHub repository. I will read in the .CSV file and display it.

untidy_df <- read.csv('https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/flights.csv')

Here’s what the untidy data frame.

X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

Data Cleaning

The Plan

Before starting on the data cleaning I think it is important to identify what needs to change in the data set. The name the first two variables needs to change. There is a blank row between the two airlines that needs to change. The dots for the airport names need to be replaced with spaces. This will evenutally become a categorical variable. The first to variables data types need to be changed from factors to characters. The first variable has some blanks that need to be filled too.

Once these changes are made the data needs to be normalized using the gather() command from tidyr. Knowing that I will want to create some visualizations I will convert the airline, arrival type and airport variables back into factors. It will make things easier to work with.

The Code

library(dplyr)
library(tidyr)

tidy_df <- untidy_df %>%
  rename(airline = X,
         arrival_type = X.1,
         "Los Angeles" = Los.Angeles,
         "San Diego" = San.Diego,
         "San Francisco" = San.Francisco)%>%
  mutate(airline = as.character(airline),
         arrival_type = as.character(arrival_type)) %>%
  na.omit() %>% # Drop the empty row
  mutate(airline = ifelse(nchar(airline)==0, lag(airline), airline)) %>% # Fill in the holes
  gather(., airport, flights, -airline, -arrival_type) %>% # Transform from wide to long
  mutate(airline = as.factor(airline),
         arrival_type = as.factor(arrival_type),
         airport = as.factor(airport))

Here’s what the table looks like.

airline arrival_type airport flights
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

Analysis

There are a few questions that I want to ask this tidy data set. But before I begin I will compute a few totals, and load ggplot2

library(ggplot2)

total_flights_by_airline <- tidy_df %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

total_flights_by_airport <- tidy_df %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

total_flights_by_arrival_type <- tidy_df %>%
  group_by(arrival_type) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

Airline Examination

I want to begin by examining the airlines. What share of the flights are on time for each airline?

airline arrival_type flights total_flights share
ALASKA delayed 501 3775 0.1327
ALASKA on time 3274 3775 0.8673
AM WEST delayed 787 7225 0.1089
AM WEST on time 6438 7225 0.8911

It looks like AM West is on time 89% of the time on average. Alaska airlines is on time about 87% of the time. One also observes that AM West has more flights than Alaska airlines.

Is this difference in the on time/delayed share due to the airports the airlines fly to? Let’s examine the share of times the airline flies to an airport

tidy_df %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

So Alaska airlines flies to San Fancisco and Seatle more tha AM West. Let’s look at this picture again by comparing across all flights the airline takes.

tidy_df %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Airline and Airports") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

This shows a litle different picture. 57% of Alaska airlines flights are to Seatle, which makes sense. AM West flies to Phoenix most often then LA.

Airport Examination

So lets take a closer look at the airports Let’s see if there is a difference between the arrival times and the airport.

tidy_df %>%
  group_by(airport, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, arrival_type)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Flights by Arrival Type and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

The Phoenix airport has the best record for on time arrivals and San Francisco has the worst.

Let’s look at the on time arrivals. How do these two airlines stack up when we look at the airports. If each airlines are the same.

on_time_flights_by_airport <- tidy_df %>%
  filter(arrival_type == "on time") %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

tidy_df %>%
  filter(arrival_type == "on time") %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(on_time_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of On Time Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

Most of the on time flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.

delayed_flights_by_airport <- tidy_df %>%
  filter(arrival_type == "delayed") %>%
  group_by(airport) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

tidy_df %>%
  filter(arrival_type == "delayed") %>%
  group_by(airline, airport) %>%
  summarise(flights = sum(flights)) %>%
  merge(delayed_flights_by_airport) %>%
  mutate(share = flights / total_flights) %>%
  ggplot(aes(airport, airline)) +
  geom_tile(aes(fill = share), show.legend = FALSE) +
  geom_text(aes(label = round(share*100,0))) +
  scale_fill_distiller(palette = "Spectral") + 
  coord_fixed() +
  ggtitle("Percent of Delayed Flights by Airline and Airport") +
  theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))

Most of the delayed flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.

What If Scenario

Now I want to do a what if scenario. What if these two airlines don’t fly to Phoenix or Seattle. How would that change these airline on time arrivals?

what_if <- tidy_df %>%
  filter(!airport %in% c("Phoenix", "Seatle"))

what_if_total_flights_by_airline <- what_if %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

temp <- what_if %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(what_if_total_flights_by_airline) %>%
  mutate(what_if_share = flights / total_flights) %>%
  select(airline, arrival_type, what_if_share)

scenario <- tidy_df %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  select(airline, arrival_type, share) %>%
  merge(temp) %>%
  mutate(delta = what_if_share - share)
airline arrival_type share what_if_share delta
ALASKA delayed 0.1327 0.1381 0.005342
ALASKA on time 0.8673 0.8619 -0.005342
AM WEST delayed 0.1089 0.1888 0.07991
AM WEST on time 0.8911 0.8112 -0.07991

We see that this would not change the airline’s ratings much. Alaska Airlines is basically unchange and AM West changed by less than 1 percentage point.

What if they only flew to Phoenix or Seatle? What would that do?

what_if <- tidy_df %>%
  filter(airport %in% c("Phoenix", "Seatle"))

what_if_total_flights_by_airline <- what_if %>%
  group_by(airline) %>%
  summarise(total_flights = sum(flights)) %>%
  ungroup()

temp <- what_if %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(what_if_total_flights_by_airline) %>%
  mutate(what_if_share = flights / total_flights) %>%
  select(airline, arrival_type, what_if_share)

scenario <- tidy_df %>%
  group_by(airline, arrival_type) %>%
  summarise(flights = sum(flights)) %>%
  merge(total_flights_by_airline) %>%
  mutate(share = flights / total_flights) %>%
  select(airline, arrival_type, share) %>%
  merge(temp) %>%
  mutate(delta = what_if_share - share)
airline arrival_type share what_if_share delta
ALASKA delayed 0.1327 0.0515 -0.08121
ALASKA on time 0.8673 0.9485 0.08121
AM WEST delayed 0.1089 0.07897 -0.02995
AM WEST on time 0.8911 0.921 0.02995

That would give Alaska Airlines on time arrivals a boost of 8 percentage points and AM West a 3 percentage point boost.

Optimization

Let’s suppose that these two airlines live and die by their on time arrival rating. What set of airports would then need to fly to to get their highest rating?

I will use the combn function to build out the list of all possible combinations. I will loop through these combinations and run what if scenarios. If the on time arrival share is bigger than the

# Initial settings for on time arrival share
alaska_on_time <- 0
am_west_on_time <- 0

# Get a vector of airports
airports <- tidy_df$airport %>%
  unique() %>%
  as.vector()

# Loop through all combinations
for(m in length(airports):1){
  airport_combinations <- combn(airports, m)
  # Run a what if scenario for the combination
  for(col in 1:dim(airport_combinations)[2]){
    what_if <- tidy_df %>%
      filter(airport %in% airport_combinations[, col])
    
    what_if_total_flights_by_airline <- what_if %>%
      group_by(airline) %>%
      summarise(total_flights = sum(flights)) %>%
      ungroup()
    
    temp <- what_if %>%
      filter(arrival_type == "on time") %>%
      group_by(airline) %>%
      summarise(flights = sum(flights)) %>%
      merge(what_if_total_flights_by_airline) %>%
      mutate(what_if_share = flights / total_flights) %>%
      select(airline, what_if_share)
    
    # See if the on time share is larger and if so save the airport set 
    if(temp[1,2] > alaska_on_time){
      alaska_on_time <- temp[1, 2]
      alaska_optimal <- airport_combinations[, col]
    }
    
    if(temp[2,2] > am_west_on_time){
      am_west_on_time <- temp[2, 2]
      am_west_optimal <- airport_combinations[, col]
    }
  }
}

So currently Alaska airlines and AM West has an on time arrival 86.73% and 89.11% respectively. If they selected the optimal set of airports to fly into the on time arrival would increase to 94.85% and 92.1%.

Alaska Airlines would only fly to Phoenix and AM West would only fly to Phoenix.

Mike Silva

September 28, 2018