Week 5 Assignment
Description
The assignment is to load in an “untidy” .CSV file and transform it into a tidy dataset. The data contains the arrival data for two airlines. After the data has been wrangled I am to preform an analysis comparing the arrival delays for the two airlines, and publish my code and findings to rpubs and GitHub.
Data Acquisition
I have created a .CSV file that is identical to the assignment and have it hosted in my GitHub repository. I will read in the .CSV file and display it.
untidy_df <- read.csv('https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/flights.csv')
Here’s what the untidy data frame.
X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
delayed | 62 | 12 | 20 | 102 | 305 | |
NA | NA | NA | NA | NA | ||
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
delayed | 117 | 415 | 65 | 129 | 61 |
Data Cleaning
The Plan
Before starting on the data cleaning I think it is important to identify what needs to change in the data set. The name the first two variables needs to change. There is a blank row between the two airlines that needs to change. The dots for the airport names need to be replaced with spaces. This will evenutally become a categorical variable. The first to variables data types need to be changed from factors to characters. The first variable has some blanks that need to be filled too.
Once these changes are made the data needs to be normalized using the gather() command from tidyr. Knowing that I will want to create some visualizations I will convert the airline, arrival type and airport variables back into factors. It will make things easier to work with.
The Code
library(dplyr)
library(tidyr)
tidy_df <- untidy_df %>%
rename(airline = X,
arrival_type = X.1,
"Los Angeles" = Los.Angeles,
"San Diego" = San.Diego,
"San Francisco" = San.Francisco)%>%
mutate(airline = as.character(airline),
arrival_type = as.character(arrival_type)) %>%
na.omit() %>% # Drop the empty row
mutate(airline = ifelse(nchar(airline)==0, lag(airline), airline)) %>% # Fill in the holes
gather(., airport, flights, -airline, -arrival_type) %>% # Transform from wide to long
mutate(airline = as.factor(airline),
arrival_type = as.factor(arrival_type),
airport = as.factor(airport))
Here’s what the table looks like.
airline | arrival_type | airport | flights |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | delayed | Los Angeles | 62 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San Diego | 212 |
ALASKA | delayed | San Diego | 20 |
AM WEST | on time | San Diego | 383 |
AM WEST | delayed | San Diego | 65 |
ALASKA | on time | San Francisco | 503 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | on time | San Francisco | 320 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 61 |
Analysis
There are a few questions that I want to ask this tidy data set. But before I begin I will compute a few totals, and load ggplot2
library(ggplot2)
total_flights_by_airline <- tidy_df %>%
group_by(airline) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
total_flights_by_airport <- tidy_df %>%
group_by(airport) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
total_flights_by_arrival_type <- tidy_df %>%
group_by(arrival_type) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
Airline Examination
I want to begin by examining the airlines. What share of the flights are on time for each airline?
airline | arrival_type | flights | total_flights | share |
---|---|---|---|---|
ALASKA | delayed | 501 | 3775 | 0.1327 |
ALASKA | on time | 3274 | 3775 | 0.8673 |
AM WEST | delayed | 787 | 7225 | 0.1089 |
AM WEST | on time | 6438 | 7225 | 0.8911 |
It looks like AM West is on time 89% of the time on average. Alaska airlines is on time about 87% of the time. One also observes that AM West has more flights than Alaska airlines.
Is this difference in the on time/delayed share due to the airports the airlines fly to? Let’s examine the share of times the airline flies to an airport
tidy_df %>%
group_by(airline, airport) %>%
summarise(flights = sum(flights)) %>%
merge(total_flights_by_airport) %>%
mutate(share = flights / total_flights) %>%
ggplot(aes(airport, airline)) +
geom_tile(aes(fill = share), show.legend = FALSE) +
geom_text(aes(label = round(share*100,0))) +
scale_fill_distiller(palette = "Spectral") +
coord_fixed() +
ggtitle("Percent of Flights by Airline and Airport") +
theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))
So Alaska airlines flies to San Fancisco and Seatle more tha AM West. Let’s look at this picture again by comparing across all flights the airline takes.
tidy_df %>%
group_by(airline, airport) %>%
summarise(flights = sum(flights)) %>%
merge(total_flights_by_airline) %>%
mutate(share = flights / total_flights) %>%
ggplot(aes(airport, airline)) +
geom_tile(aes(fill = share), show.legend = FALSE) +
geom_text(aes(label = round(share*100,0))) +
scale_fill_distiller(palette = "Spectral") +
coord_fixed() +
ggtitle("Percent of Flights by Airline and Airports") +
theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))
This shows a litle different picture. 57% of Alaska airlines flights are to Seatle, which makes sense. AM West flies to Phoenix most often then LA.
Airport Examination
So lets take a closer look at the airports Let’s see if there is a difference between the arrival times and the airport.
tidy_df %>%
group_by(airport, arrival_type) %>%
summarise(flights = sum(flights)) %>%
merge(total_flights_by_airport) %>%
mutate(share = flights / total_flights) %>%
ggplot(aes(airport, arrival_type)) +
geom_tile(aes(fill = share), show.legend = FALSE) +
geom_text(aes(label = round(share*100,0))) +
scale_fill_distiller(palette = "Spectral") +
coord_fixed() +
ggtitle("Percent of Flights by Arrival Type and Airport") +
theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))
The Phoenix airport has the best record for on time arrivals and San Francisco has the worst.
Let’s look at the on time arrivals. How do these two airlines stack up when we look at the airports. If each airlines are the same.
on_time_flights_by_airport <- tidy_df %>%
filter(arrival_type == "on time") %>%
group_by(airport) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
tidy_df %>%
filter(arrival_type == "on time") %>%
group_by(airline, airport) %>%
summarise(flights = sum(flights)) %>%
merge(on_time_flights_by_airport) %>%
mutate(share = flights / total_flights) %>%
ggplot(aes(airport, airline)) +
geom_tile(aes(fill = share), show.legend = FALSE) +
geom_text(aes(label = round(share*100,0))) +
scale_fill_distiller(palette = "Spectral") +
coord_fixed() +
ggtitle("Percent of On Time Flights by Airline and Airport") +
theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))
Most of the on time flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.
delayed_flights_by_airport <- tidy_df %>%
filter(arrival_type == "delayed") %>%
group_by(airport) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
tidy_df %>%
filter(arrival_type == "delayed") %>%
group_by(airline, airport) %>%
summarise(flights = sum(flights)) %>%
merge(delayed_flights_by_airport) %>%
mutate(share = flights / total_flights) %>%
ggplot(aes(airport, airline)) +
geom_tile(aes(fill = share), show.legend = FALSE) +
geom_text(aes(label = round(share*100,0))) +
scale_fill_distiller(palette = "Spectral") +
coord_fixed() +
ggtitle("Percent of Delayed Flights by Airline and Airport") +
theme(panel.grid.major=element_blank(), axis.title.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank(), plot.title = element_text(hjust = 0.5))
Most of the delayed flights to Phoenix are due to AM West, and most of the on time flights to Seatle are due to Alaska Airlines. Let’s look at the delayed flights.
What If Scenario
Now I want to do a what if scenario. What if these two airlines don’t fly to Phoenix or Seattle. How would that change these airline on time arrivals?
what_if <- tidy_df %>%
filter(!airport %in% c("Phoenix", "Seatle"))
what_if_total_flights_by_airline <- what_if %>%
group_by(airline) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
temp <- what_if %>%
group_by(airline, arrival_type) %>%
summarise(flights = sum(flights)) %>%
merge(what_if_total_flights_by_airline) %>%
mutate(what_if_share = flights / total_flights) %>%
select(airline, arrival_type, what_if_share)
scenario <- tidy_df %>%
group_by(airline, arrival_type) %>%
summarise(flights = sum(flights)) %>%
merge(total_flights_by_airline) %>%
mutate(share = flights / total_flights) %>%
select(airline, arrival_type, share) %>%
merge(temp) %>%
mutate(delta = what_if_share - share)
airline | arrival_type | share | what_if_share | delta |
---|---|---|---|---|
ALASKA | delayed | 0.1327 | 0.1381 | 0.005342 |
ALASKA | on time | 0.8673 | 0.8619 | -0.005342 |
AM WEST | delayed | 0.1089 | 0.1888 | 0.07991 |
AM WEST | on time | 0.8911 | 0.8112 | -0.07991 |
We see that this would not change the airline’s ratings much. Alaska Airlines is basically unchange and AM West changed by less than 1 percentage point.
What if they only flew to Phoenix or Seatle? What would that do?
what_if <- tidy_df %>%
filter(airport %in% c("Phoenix", "Seatle"))
what_if_total_flights_by_airline <- what_if %>%
group_by(airline) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
temp <- what_if %>%
group_by(airline, arrival_type) %>%
summarise(flights = sum(flights)) %>%
merge(what_if_total_flights_by_airline) %>%
mutate(what_if_share = flights / total_flights) %>%
select(airline, arrival_type, what_if_share)
scenario <- tidy_df %>%
group_by(airline, arrival_type) %>%
summarise(flights = sum(flights)) %>%
merge(total_flights_by_airline) %>%
mutate(share = flights / total_flights) %>%
select(airline, arrival_type, share) %>%
merge(temp) %>%
mutate(delta = what_if_share - share)
airline | arrival_type | share | what_if_share | delta |
---|---|---|---|---|
ALASKA | delayed | 0.1327 | 0.0515 | -0.08121 |
ALASKA | on time | 0.8673 | 0.9485 | 0.08121 |
AM WEST | delayed | 0.1089 | 0.07897 | -0.02995 |
AM WEST | on time | 0.8911 | 0.921 | 0.02995 |
That would give Alaska Airlines on time arrivals a boost of 8 percentage points and AM West a 3 percentage point boost.
Optimization
Let’s suppose that these two airlines live and die by their on time arrival rating. What set of airports would then need to fly to to get their highest rating?
I will use the combn function to build out the list of all possible combinations. I will loop through these combinations and run what if scenarios. If the on time arrival share is bigger than the
# Initial settings for on time arrival share
alaska_on_time <- 0
am_west_on_time <- 0
# Get a vector of airports
airports <- tidy_df$airport %>%
unique() %>%
as.vector()
# Loop through all combinations
for(m in length(airports):1){
airport_combinations <- combn(airports, m)
# Run a what if scenario for the combination
for(col in 1:dim(airport_combinations)[2]){
what_if <- tidy_df %>%
filter(airport %in% airport_combinations[, col])
what_if_total_flights_by_airline <- what_if %>%
group_by(airline) %>%
summarise(total_flights = sum(flights)) %>%
ungroup()
temp <- what_if %>%
filter(arrival_type == "on time") %>%
group_by(airline) %>%
summarise(flights = sum(flights)) %>%
merge(what_if_total_flights_by_airline) %>%
mutate(what_if_share = flights / total_flights) %>%
select(airline, what_if_share)
# See if the on time share is larger and if so save the airport set
if(temp[1,2] > alaska_on_time){
alaska_on_time <- temp[1, 2]
alaska_optimal <- airport_combinations[, col]
}
if(temp[2,2] > am_west_on_time){
am_west_on_time <- temp[2, 2]
am_west_optimal <- airport_combinations[, col]
}
}
}
So currently Alaska airlines and AM West has an on time arrival 86.73% and 89.11% respectively. If they selected the optimal set of airports to fly into the on time arrival would increase to 94.85% and 92.1%.
Alaska Airlines would only fly to Phoenix and AM West would only fly to Phoenix.