The data for this assignment is loaded in a wide format, making it difficult to immediately perform exploratory work on the dataset. To address this the data will be reformatted from a wide to long format.
Exploratory work to address the following questions will be completed with the reformatted data. Information about each airline: 1. What is the number of ontime and delayed flights for each airline? 2. What is the percentage of flight delays for each airline?
Information about each destination (regardless of airline) 3. What is the distribution of delays? 4. Which destinations have the most and least number of flights (highest volume)?
A conclusion section will summarize the analysis and make recommendations for further study.
library(dplyr)
library(tidyr)
library(data.table)
library(ggplot2)
data <- read.csv('flights.csv')
head(data,4)
## ï..Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AmWest on time 694 4840 383 320 201
## 4 AmWest delayed 117 415 65 129 61
#lower case the columns and replace "." with "_"
names(data) <- tolower(names(data))
names(data) <- gsub(x = names(data), pattern = "\\.", replacement = "_")
names(data) <- gsub(x = names(data), pattern = "\\__", replacement = "_")
names(data) <- gsub(x = names(data), pattern = "\\ï_", replacement = "")
data
## airline status los_angeles phoenix san_diego san_francisco seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AmWest on time 694 4840 383 320 201
## 4 AmWest delayed 117 415 65 129 61
#Stack the destination data
data2 <- pivot_longer(data,c(los_angeles, phoenix, san_diego, san_francisco, seattle), 'destination')
data2
## # A tibble: 20 x 4
## airline status destination value
## <fct> <fct> <chr> <int>
## 1 Alaska on time los_angeles 497
## 2 Alaska on time phoenix 221
## 3 Alaska on time san_diego 212
## 4 Alaska on time san_francisco 503
## 5 Alaska on time seattle 1841
## 6 Alaska delayed los_angeles 62
## 7 Alaska delayed phoenix 12
## 8 Alaska delayed san_diego 20
## 9 Alaska delayed san_francisco 102
## 10 Alaska delayed seattle 305
## 11 AmWest on time los_angeles 694
## 12 AmWest on time phoenix 4840
## 13 AmWest on time san_diego 383
## 14 AmWest on time san_francisco 320
## 15 AmWest on time seattle 201
## 16 AmWest delayed los_angeles 117
## 17 AmWest delayed phoenix 415
## 18 AmWest delayed san_diego 65
## 19 AmWest delayed san_francisco 129
## 20 AmWest delayed seattle 61
1. What is the number of ontime and delayed flights for each airline?
total_ontime_delays <- select(data2, airline, status, value)
total_ontime_delays <- group_by(total_ontime_delays,airline, status) %>% summarise(flights = sum(value))
#ontime versus delayed in terms of number of flights per airline
ggplot(total_ontime_delays, aes(fill=status, y=flights, x=airline)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_manual(values=c('darkblue','darkgreen')) +
geom_text(aes(label=flights), position=position_dodge(width=0.2), vjust=-0.25)
2. What is the percentage of flight delays for each airline?
delays_percent <- pivot_wider(total_ontime_delays, names_from = status, values_from = flights)
names(delays_percent) <- gsub(x = names(delays_percent), pattern = "\\ ", replacement = "_")
delays_percent$percent_flights_delayed <- round(delays_percent$delayed/delays_percent$on_time,2)
delays_percent
## # A tibble: 2 x 4
## # Groups: airline [2]
## airline delayed on_time percent_flights_delayed
## <fct> <int> <int> <dbl>
## 1 Alaska 501 3274 0.15
## 2 AmWest 787 6438 0.12
3. What is the distribution of delays?
data3 <- select(data2, destination, status, value)
#Ontime versus delayed as a percentage per destination
ggplot(data3, aes(fill=status, y=value, x=destination)) +
geom_bar(position="fill", stat="identity") +
scale_fill_manual(values=c('darkblue','darkgreen'))
4. Which destination have the most and least number of flights (highest volume)?
#ontime versus delayed in terms of number of flights per destination
ggplot(data3, aes(fill=status, y=value, x=destination), colors('blue','green')) +
geom_bar(position="stack", stat="identity") +
scale_fill_manual(values=c('darkblue','darkgreen'))
AmWest has almost twice as many flights (6,438 vs 3,274) as Alaska airlines, but both companies have a similar percentage of delays (12% Amwest vs 15% Alaska). While at first glance this percentage might not seem bad, it does lend itself to needing improvement if you consider total number of travelers delayed.
According to the site Travel Weekly, in 2017 the average number of passengers was 91 and Statista placed average capacity filled at around 82%. Multiplying (91 * (501 Alaska + 787 AmWest delays) * 0.82) is approximately 96,000 passengers. And this is just these two airlines!
Since the airlines have a similar delay percentage, lets take a look to see if some destinations have significantly better or worse rates of delay. Interesting to note is that Phoenix has the lowest percentage of delays and the highest number of flights at (8% and 5,488). San Francisco has the highest percentage of delays and the 2nd lowest number of flights (22% and 1,054).
One hypothesis from this data is that flight delays could go down if conditions at airports with higher percentages of delay would improve. Additional study is required to validate this hypothesis as this a very small cross-section of data from the full airline industry.