Assignment Summary

The data for this assignment is loaded in a wide format, making it difficult to immediately perform exploratory work on the dataset. To address this the data will be reformatted from a wide to long format.

Exploratory work to address the following questions will be completed with the reformatted data.
Information about each airline:
1. What is the number of ontime and delayed flights for each airline?
2. What is the percentage of flight delays for each airline?

Information about each destination (regardless of airline)
3. What is the distribution of delays?
4. Which destinations have the most and least number of flights (highest volume)?

A conclusion section will summarize the analysis and make recommendations for further study.

Setup and load data

library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
data <-  read.csv('flights.csv')
head(data,4)
##   ï..Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1     Alaska on time         497     221       212           503    1841
## 2     Alaska delayed          62      12        20           102     305
## 3     AmWest on time         694    4840       383           320     201
## 4     AmWest delayed         117     415        65           129      61

Tidy up the data

#lower case the columns and replace "." with "_"
names(data) <- tolower(names(data))
names(data) <- gsub(x = names(data), pattern = "\\.", replacement = "_")  
names(data) <- gsub(x = names(data), pattern = "\\__", replacement = "_")  
names(data) <- gsub(x = names(data), pattern = "\\ï_", replacement = "")  
data
##   airline  status los_angeles phoenix san_diego san_francisco seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3  AmWest on time         694    4840       383           320     201
## 4  AmWest delayed         117     415        65           129      61
#Stack the destination data
data2 <- pivot_longer(data,c(los_angeles, phoenix, san_diego, san_francisco, seattle), 'destination')
data2
## # A tibble: 20 x 4
##    airline status  destination   value
##    <fct>   <fct>   <chr>         <int>
##  1 Alaska  on time los_angeles     497
##  2 Alaska  on time phoenix         221
##  3 Alaska  on time san_diego       212
##  4 Alaska  on time san_francisco   503
##  5 Alaska  on time seattle        1841
##  6 Alaska  delayed los_angeles      62
##  7 Alaska  delayed phoenix          12
##  8 Alaska  delayed san_diego        20
##  9 Alaska  delayed san_francisco   102
## 10 Alaska  delayed seattle         305
## 11 AmWest  on time los_angeles     694
## 12 AmWest  on time phoenix        4840
## 13 AmWest  on time san_diego       383
## 14 AmWest  on time san_francisco   320
## 15 AmWest  on time seattle         201
## 16 AmWest  delayed los_angeles     117
## 17 AmWest  delayed phoenix         415
## 18 AmWest  delayed san_diego        65
## 19 AmWest  delayed san_francisco   129
## 20 AmWest  delayed seattle          61

Explore the Questions

1. What is the number of ontime and delayed flights for each airline?

total_ontime_delays <- select(data2, airline, status, value)
total_ontime_delays <- group_by(total_ontime_delays,airline, status) %>% summarise(flights = sum(value))
#ontime versus delayed in terms of number of flights per airline
ggplot(total_ontime_delays, aes(fill=status, y=flights, x=airline)) + 
  geom_bar(position="dodge", stat="identity") +
  scale_fill_manual(values=c('darkblue','darkgreen')) +
  geom_text(aes(label=flights), position=position_dodge(width=0.2), vjust=-0.25)

2. What is the percentage of flight delays for each airline?

delays_percent <- pivot_wider(total_ontime_delays, names_from = status, values_from = flights)
names(delays_percent) <- gsub(x = names(delays_percent), pattern = "\\ ", replacement = "_")  

delays_percent$percent_flights_delayed <- round(delays_percent$delayed/delays_percent$on_time,2)

delays_percent
## # A tibble: 2 x 4
## # Groups:   airline [2]
##   airline delayed on_time percent_flights_delayed
##   <fct>     <int>   <int>                   <dbl>
## 1 Alaska      501    3274                    0.15
## 2 AmWest      787    6438                    0.12

3. What is the distribution of delays?

data3 <- select(data2, destination, status, value)
#Ontime versus delayed as a percentage per destination
ggplot(data3, aes(fill=status, y=value, x=destination)) + 
    geom_bar(position="fill", stat="identity") + 
    scale_fill_manual(values=c('darkblue','darkgreen'))

4. Which destination have the most and least number of flights (highest volume)?

#ontime versus delayed in terms of number of flights per destination
ggplot(data3, aes(fill=status, y=value, x=destination), colors('blue','green')) + 
  geom_bar(position="stack", stat="identity") + 
  scale_fill_manual(values=c('darkblue','darkgreen'))

Conclusion

AmWest has almost twice as many flights (6,438 vs 3,274) as Alaska airlines, but both companies have a similar percentage of delays (12% Amwest vs 15% Alaska). While at first glance this percentage might not seem bad, it does lend itself to needing improvement if you consider total number of travelers delayed.

According to the site Travel Weekly, in 2017 the average number of passengers was 91 and Statista placed average capacity filled at around 82%. Multiplying (91 * (501 Alaska + 787 AmWest delays) * 0.82) is approximately 96,000 passengers. And this is just these two airlines!

Since the airlines have a similar delay percentage, lets take a look to see if some destinations have significantly better or worse rates of delay. Interesting to note is that Phoenix has the lowest percentage of delays and the highest number of flights at (8% and 5,488). San Francisco has the highest percentage of delays and the 2nd lowest number of flights (22% and 1,054).

One hypothesis from this data is that flight delays could go down if conditions at airports with higher percentages of delay would improve. Additional study is required to validate this hypothesis as this a very small cross-section of data from the full airline industry.

Recommendations for further study…

  1. Broaden this analysis to include a larger number of airlines and destinations to see if the hypothesis is valid
  2. Study conditions at the Phoenix and San Francisco airports to identify key differentiators between the two and the impact this could have on reducing delays.