suppressMessages(library(tidyverse))
As a first step before conducting our analysis, we should check whether the data is tidy and make the any necessary transformations.
In this particular case there are two issues:
We have a column for each city, which spreads variables over multiple columns. In order for the data to be tidy, each variable should be listed on a single column.
The “delayed” and “on time” variables are embedded in rows. We need to move these variables to their own columns.
The code for these two transformations is found below.
df1 <-
read.table("flights.csv", sep = ",", header = TRUE) %>% # Import the CSV data
as_tibble() %>% # Convert to a tibble for easier on-screen reading
gather("city", "count", 3:7) %>% # 1. Move cities to rows, leaving flight counts on single column
spread("Status", "count") %>% # 2. Moves the status variables to their own columns
rename(airline = Airline, # make all variables lower-case
ontime = `on time`) %>% # remove the space for easier manipulation
mutate(totflights = ontime + delayed, # We create totflights by adding ontime and delayed
delayedprct = delayed / totflights) %>% #Delayedprct is calculated from delayed and totflights
arrange(desc(delayedprct)) # We sort by delayedprct
(df1)
## # A tibble: 10 x 6
## airline city delayed ontime totflights delayedprct
## <fct> <chr> <int> <int> <int> <dbl>
## 1 Amwest San.Francisco 129 320 449 0.287
## 2 Amwest Seattle 61 201 262 0.233
## 3 Alaska San.Francisco 102 503 605 0.169
## 4 Amwest San.Diego 65 383 448 0.145
## 5 Amwest Los.Angeles 117 694 811 0.144
## 6 Alaska Seattle 305 1841 2146 0.142
## 7 Alaska Los.Angeles 62 497 559 0.111
## 8 Alaska San.Diego 20 212 232 0.0862
## 9 Amwest Phoenix 415 4840 5255 0.0790
## 10 Alaska Phoenix 12 221 233 0.0515
To conduct our analysis, we’ll first create a graph that shows all flights for each destinations, separated by airline and status. For this we’ll need to create an auxiliary data frame (graphdf) which has the status variables as rows.
graphdf <-
read.table("flights.csv", sep = ",", header = TRUE) %>%
as_tibble() %>%
gather("city", "count", 3:7) %>%
rename(airline = Airline)
head(graphdf)
## # A tibble: 6 x 4
## airline Status city count
## <fct> <fct> <chr> <int>
## 1 Alaska on time Los.Angeles 497
## 2 Alaska delayed Los.Angeles 62
## 3 Amwest on time Los.Angeles 694
## 4 Amwest delayed Los.Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
Now that we have graphdf we can use GGPlot2 to create the stacked bar plot.
p <- ggplot(graphdf, aes(x = airline, y = count,fill = Status))
p <- p + scale_fill_manual(values=c('#8D0013','#01426A'))
p <- p + geom_bar(stat = "identity", width = 0.95, position = "stack")
p <- p + facet_grid(. ~ city)
p
From the graph we can compare the absolute number of flights for each airline and destination but it’s difficult to objectively compare the arrival delays of each company.
Using the original data frame (df1) we can generate a graph which focuses on the delay rate.
p <- ggplot(data = df1, aes(x = airline, y = delayedprct, fill = airline))
p <- p + scale_y_continuous(labels = scales::percent)
p <- p + theme(legend.position = "none")
#p <- p + scale_fill_manual(values=c('#01426A','#8D0013'))
p <- p + geom_bar(stat = "identity", width = 0.95, position = "stack")
p <- p + facet_grid(. ~ city)
p
It’s now clear that Amwest has a higher percentage of delayed flights, compared to Alaska. At one extreme, the Amwest flight to San Francisco is delayed almost 30% of the time while the most punctual flight is to Phoenix, via Alaska. As this point, it would seem like Alaska is clearly the superior airline since it has a lower proportion of delayed flights for all destinations.
df3 <- df1 %>%
group_by(airline) %>%
summarise(delayed = sum(delayed),
ontime = sum(ontime)) %>%
mutate(totflights = ontime + delayed,
delayedprct = delayed / totflights)
(df3)
## # A tibble: 2 x 5
## airline delayed ontime totflights delayedprct
## <fct> <int> <int> <int> <dbl>
## 1 Alaska 501 3274 3775 0.133
## 2 Amwest 787 6438 7225 0.109
p <- ggplot(data = df3, aes(x = airline, y = delayedprct, fill = airline))
p <- p + scale_y_continuous(labels = scales::percent)
p <- p + theme(legend.position = "none")
p <- p + geom_bar(stat = "identity", width = 0.95, position = "stack")
p
If we aggregate all the flights by airline, we find an interesting reversal; Alaska actually has a higher proportion of delayed flights. After studying the data more carefully, we find that Amwest has an unusually high number of flights to Phoenix (5255) with a relatively low delayed flight rate of just 7.9%.
df4 <-
df1 %>% group_by(city) %>%
summarise(delayed = sum(delayed),
ontime = sum(ontime)) %>%
mutate(
totflights = ontime + delayed,
delayedprct = delayed / totflights) %>%
arrange(desc(delayedprct))
(df4)
## # A tibble: 5 x 5
## city delayed ontime totflights delayedprct
## <chr> <int> <int> <int> <dbl>
## 1 San.Francisco 231 823 1054 0.219
## 2 Seattle 366 2042 2408 0.152
## 3 Los.Angeles 179 1191 1370 0.131
## 4 San.Diego 85 595 680 0.125
## 5 Phoenix 427 5061 5488 0.0778
p <- ggplot(data = df4, aes(x = city, y = delayedprct, fill = city))
p <- p + scale_y_continuous(labels = scales::percent)
p <- p + theme(legend.position = "none")
p <- p + geom_bar(stat = "identity", width = 0.95, position = "stack")
p
Finally, the graph above shows us the delay rates for each destination. As we discussed above, Amwest the majority of fights to Phoenix with a low delay rate. San Francisco has the highest delayed flight rate and both airlines perform comparably.
From a passenger’s perspective, it seems that Alaska is the better choice since it has a lower delayed flight percentage for all destinations. As a company, Amwest seems to be more successful since it runs almost twice as many flights while maintaining a lower delayed flight rate overall.