Which airline should you fly on to avoid significant delays?

Synopsis

Objective

The project report is intended to perform basic data exploratory analysis on the Flights Cancellation dataset published by US Department of Transportation. The analysis aims at revealing hidden data trends and also facts like most visited Destination, common Cancellation reasons, dependencies among various variables if any, most reliable carrier etc.

Data Source

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers.The Flights delay and cancellation data was collected and published by the DOT’s Bureau of Transportation Statistics which included summary information on the number of on-time, delayed, canceled, and diverted flights etc.

Data Format

The following times are in the xx:yy - hour:minute format (e.g. 1536 means 3:36pm, 345 means 3:45am, 16 means 00:16am)

  • scheduled_departure
  • departure_time
  • scheduled_arrival
  • arrival_time
  • wheels_off
  • wheels_on

The following times are in minutes format (negatives mean actual_time is ahead of scheduled_time for the absolute value of that negative number)

  • arrival_delay
  • departure_delay
  • taxi_in
  • taxi_out
  • scheduled_time
  • elapsed_time
  • air_time

Distance is in miles

Packages Required

For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.

## Load Required Packages ##
library(tidyverse)
library(stringr) ## Numerous functions for text manipulation
library(data.table) # useful for functions like fread
library(tidyverse) ## Tidying data
library(DT) ## Output data in nice format
library(knitr) ## Format tables
library(kableExtra) ## Add extra formatting options to tables 
library("wordcloud") ## to create a word cloud
library("RColorBrewer") ## color palletes
library(plotly) ## Used to plot interactive publication quality charts
library(treemap) ## to create a tree map
library(maps) ## to create maps
library(geosphere) 
library(highcharter) ## to create interactive treemaps

Data Cleaning

The process of data preparation involved following steps:

  1. Data was imported from three files named ‘Airports’, ‘Airlines’ and ‘Flights’ using fread.
  2. Structure (str), Glimpse and Summary were examined of the raw data to understand it.
  3. The data was then examined for missing values.
  4. In Flights, there were 12 columns for which the values were NA.
  5. A refined dataset was created by omitting NA values from 8 out of 12 columns. The reasons for exclusion of the 4 columns were first that, the values were missing in these for more than 80% of the rows and secondly, the columns were like AIR_SYSTEM_DELAY, SECURITY_DELAY which can hold NA values in case there is no delay from the original schedule.
  6. It was further observed that majority of values for Cancellation reasons were empty.
  7. Data for Airlines and Airports were also examined for NA values and it was observed that Airport had 3 NA values for latitude and longitude. These values were not removed as latitude and longitude were not required and also can be restored by searching for airports.
#Importing files into R
Airlines <- fread("airlines.csv")
Airport <- fread("airports.csv")
Flights <- fread("flights.csv")

# removing missing values
Flights_refined <- Flights %>%
  filter(!is.na(DEPARTURE_TIME) & !is.na(WHEELS_OFF) & !is.na(AIR_TIME) & !is.na(TAXI_IN) | DIVERTED == 1) %>%
  as.data.table()

The structure and summary of the refined set can be found as below:

Summary of Attributes
Attributes Class Missing_Values Min_Value Max_Value
YEAR integer 0 2015 2015
MONTH integer 0 1 12
DAY integer 0 1 31
DAY_OF_WEEK integer 0 1 7
AIRLINE character 0 AA WN
FLIGHT_NUMBER integer 0 1 9320
TAIL_NUMBER character 0 7819A N9EAMQ
ORIGIN_AIRPORT character 0 10135 YUM
DESTINATION_AIRPORT character 0 10135 YUM
SCHEDULED_DEPARTURE integer 0 1 2359
DEPARTURE_TIME integer 0 1 2400
DEPARTURE_DELAY integer 0 -82 1988
TAXI_OUT integer 0 1 225
WHEELS_OFF integer 0 1 2400
SCHEDULED_TIME integer 1 NA NA
ELAPSED_TIME integer 15187 NA NA
AIR_TIME integer 15187 NA NA
DISTANCE integer 0 31 4983
WHEELS_ON integer 2629 NA NA
TAXI_IN integer 2629 NA NA
SCHEDULED_ARRIVAL integer 0 1 2400
ARRIVAL_TIME integer 2629 NA NA
ARRIVAL_DELAY integer 15187 NA NA
DIVERTED integer 0 0 1
CANCELLED integer 0 0 0
CANCELLATION_REASON character 0
AIR_SYSTEM_DELAY integer 4665756 NA NA
SECURITY_DELAY integer 4665756 NA NA
AIRLINE_DELAY integer 4665756 NA NA
LATE_AIRCRAFT_DELAY integer 4665756 NA NA
WEATHER_DELAY integer 4665756 NA NA

Minimum and maximum Values for Character Variables will reflect less meaning

Exploratory Data Analysis

The data for flights was explored for hidden trends and useful insights on the following aspects:

  1. Type of Airlines
  2. Air Routes for flights
  3. Time of Year

Analysis by Airlines

There were a total of 14 unique carriers which had flights operating in US in 2015:

  1. United Air Lines Inc.
  2. American Airlines Inc.
  3. US Airways Inc.
  4. Frontier Airlines Inc.
  5. JetBlue Airways
  6. Skywest Airlines Inc.
  7. Alaska Airlines Inc.
  8. Spirit Air Lines
  9. Southwest Airlines Co.
  10. Delta Air Lines Inc.
  11. Atlantic Southeast Airlines
  12. Hawaiian Airlines Inc.
  13. American Eagle Airlines Inc.
  14. Virgin America
  • Airline Carrier by number of flights : The data was anlayzed for the carriers wiith respect to number of flights operated by them. It was observed that the Southwest Airlines Co.had maximum flights operating accross US.
  Airline_max_flights <- Flights_refined %>%
       group_by(AIRLINE) %>%
       summarize(no_flights = n()) %>% 
       select(AIRLINE,no_flights) %>%
       left_join(Airlines,by = c("AIRLINE" = "IATA_CODE")) %>%
       select(AIRLINE.y, no_flights)
     
     hchart(Airline_max_flights, "treemap", hcaes(x = AIRLINE.y, value = no_flights, color =no_flights)) %>% 
       hc_colorAxis(stops = color_stops(n = 10, colors = c("#2471A3", "#D5D8DC", "#FDE725"))) %>% 
       hc_title(text = "Operating Airlines by no of Flights") %>% 
       hc_add_theme(hc_theme_google()) %>% 
       hc_credits(enabled = TRUE, text = "Source : FAO", style = list(fontSize = "10px"))
  • Network of Flights : The data was then visualized by each carrier on the network of their flights over all the cities in US. Among all the below four were found to be region specific:

sample code for each Airline

Flights_network <- Flights_refined %>%
  select(AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT) %>%
  left_join(Airlines,by=c("AIRLINE" = "IATA_CODE")) %>%
  select(AIRLINE.y,ORIGIN_AIRPORT,DESTINATION_AIRPORT)
names(Flights_network)[1] <- "AIRLINE"

Flights_network<- Flights_network%>%
  left_join(Airport, by=c("ORIGIN_AIRPORT"="IATA_CODE")) %>%
  filter(!is.na(AIRPORT)) 
names(Flights_network)[names(Flights_network) == 'LATITUDE'] <- 'ORIGIN_LATITUDE'
names(Flights_network)[names(Flights_network) == 'LONGITUDE'] <- 'ORIGIN_LONGITUDE'

Flights_network <- Flights_network%>%
    left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE")) 
  names(Flights_network)[names(Flights_network) == 'LATITUDE'] <- 'DESTINATION_LATITUDE'
  names(Flights_network)[names(Flights_network) == 'LONGITUDE'] <- 'DESTINATION_LONGITUDE'

Flights_network<-  Flights_network%>%
  group_by(AIRLINE,CITY.y,CITY.x,ORIGIN_LONGITUDE,ORIGIN_LATITUDE,DESTINATION_LATITUDE,DESTINATION_LONGITUDE) %>%
  summarize(no_flights = n()) %>%
  filter(!is.na(ORIGIN_LONGITUDE) & !is.na(DESTINATION_LONGITUDE))


xlim <- c(-176.64603, -56.601563)
ylim <- c(12.039321, 71.856229)

map("world", col="#191919", fill=TRUE, bg="#000000", lwd=0.05, xlim=xlim, ylim=ylim)

pal <- colorRampPalette(c("#333333", "white", "#1292db"))
colors <- pal(100)

a<- Flights_network %>%
  filter(AIRLINE == "Virgin America")

for (j in 1:nrow(a)) {
  
  inter <- gcIntermediate( c(a[j,]$ORIGIN_LONGITUDE,a[j,]$ORIGIN_LATITUDE), c(a[j,]$DESTINATION_LONGITUDE,a[j,]$DESTINATION_LATITUDE), n=100, addStartEnd=TRUE)
  colindex <- round( (a[j,]$no_flights / max(a$no_flights)) * length(colors) )
  lines(inter, col=colors[colindex], lwd=0.8)
}
Southwest Airlines

Frontier Airlines

Alaska Airlines

Virgin America

American Airlines

American Eagle Airlines

Delta Air Lines

JetBlue Airways

Skywest Airlines

United Air Lines

Hawaiian Airlines

US Airways

  • Diverted Flights by Carrier :All the flights operated by the above Airlines were observed on the aspect of whether they were diverted from their destination or not.
   diverted_flights <- Flights_refined %>%
                  filter(DIVERTED > 0 ) %>%
    left_join(Airlines,by=c("AIRLINE" = "IATA_CODE")) %>%
    group_by(AIRLINE.y) %>%
    summarise(no_flights = n())

  
  top_10_diverted_Flights <- diverted_flights %>%
    top_n(n = 10, wt = no_flights) %>%
    ggplot() +
    geom_bar(mapping = aes(x=reorder(AIRLINE.y, no_flights),
                           y=no_flights),
             stat="identity", fill = "light green") +
    coord_flip() +
    labs(title="Top  10 Airlines experiencing maximum diversions in flights",
         x="Carriers", y="no of Flights diverted") +
    theme_minimal()
  
  top_10_diverted_Flights

  • Reliability of Carriers : The data was further explored for most reliable Airline Carrier on the basis of Delayed flights as compared to the total flights. A flight was assumed delayed if the delay in arrival was greater than 10 minutes.
Flights_Delayed <- Flights_refined %>%
select(AIRLINE,ARRIVAL_DELAY) %>%
  mutate(DELAYED =  ARRIVAL_DELAY > 10 & !(is.na(ARRIVAL_DELAY))) %>%
  group_by(AIRLINE) %>%
    summarize(PERCENTAGE_DELAYED = sum(DELAYED)/n()) %>%
  left_join(Airlines,by = c("AIRLINE" = "IATA_CODE")) %>%
  select(AIRLINE.y, PERCENTAGE_DELAYED)

Flights_Delayed_df<- rbind(Flights_Delayed%>% mutate(group = "Delayed"),Flights_Delayed%>%
            mutate(PERCENTAGE_DELAYED = 1 - PERCENTAGE_DELAYED, group = "OnTime") %>%
            select( PERCENTAGE_DELAYED, group, AIRLINE.y))


ggplot(data = Flights_Delayed_df, aes(x = Flights_Delayed_df$AIRLINE.y, y = Flights_Delayed_df$PERCENTAGE_DELAYED, fill = Flights_Delayed_df$group)) + 
  geom_bar(stat = "identity") +
  coord_flip() +
  xlab("Airlines") +
  ylab("Percentage of Delayed Flights") +
  labs(fill = '') +
  theme_bw()

Analysis by Routes

The Flights data was analyzed with respect to the air-routes on which they flew in 2015 accross US, i.e. Origin and Destination airports. The purpose of this analysis was to uncover trends like most prefered destinations, various kind of delays on origin and destination airports, routes with maximum air traffic etc.

Preferred Destinations

The data was further explored to find the destination airports with maximum number of flights. These were the points with maximum air traffic density.

Preferred_destination<- Flights_refined %>%
    left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE")) %>%
    group_by(CITY) %>%
    summarize(no_flights =n()) 


  #building the word cloud
  wordcloud(words = Preferred_destination$CITY, freq = Preferred_destination$no_flights, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"),main="Most Preferred Destinations")

Weather Delay

The data was then explored to uncover routes with maximum weather delay. This insight could be useful to keep a time buffer while travelling on these routes as delay in arrival is expected on them.

  Flights_routes <- select(Flights_refined, ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_TIME,DEPARTURE_DELAY,ARRIVAL_DELAY
                           , DISTANCE,CANCELLED,DIVERTED,WEATHER_DELAY) %>%
    filter(!is.na(WEATHER_DELAY))
  
      Weather_delay<- Flights_routes %>%
      group_by(ORIGIN_AIRPORT,DESTINATION_AIRPORT) %>%
      summarize(WEATHER_DELAY =sum(WEATHER_DELAY>0)) %>%
    left_join(Airport, by=c("ORIGIN_AIRPORT"="IATA_CODE")) %>%
    filter(!is.na(AIRPORT)) %>%
    left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE")) %>%
    filter(!is.na(AIRPORT.x) & !is.na(AIRPORT.y)) %>%
      arrange(desc(WEATHER_DELAY)) %>%
                head(n = 25) %>%
                mutate(Route = paste(CITY.x,CITY.y, sep = " - "))
    
 
     hchart(Weather_delay, "treemap", hcaes(x = Route, value = WEATHER_DELAY, color =WEATHER_DELAY)) %>% 
       hc_colorAxis(stops = color_stops(n = 10, colors = c("#F4D03F", "#229954", "#D35400"))) %>% 
       hc_title(text = "Routes with maximum weather delay") %>% 
       hc_add_theme(hc_theme_google()) %>% 
       hc_credits(enabled = TRUE, text = "Source : FAO", style = list(fontSize = "10px"))

Analysis by Time

The data was also analyzed in the aspect of variation in flights attributes with respect to time in the year. There were insights such as the distribution of number of flights over all the months in 2015 grouped by carrier.

Monthly flights’ variation

The data was then grouped by Months to visualize the trend of variation in number of flights in 2015. The data was further grouped by carrier within the month so as to observe the trend by Airline carriers.

Weather Delays

The data was further analyzed on variation in number of weather delays accross the months in 2015. It was found that Februray observed maximum number of Weather Delays.

Summary

After the above data analysis, the following insights were obtained from the Flight cancellation and delays data for the year 2015:

  • Chicago was one of the busiest of airports catering to maximum number of flights.
  • Southwest Airlines Co. was the Airline with maximum number of flights.
  • Chicago and Seattle were areas with high flight density.
  • Almost 20 % of the flights were delayed from their scheduled arrival.
  • February was the month with least air traffic.
  • February was the month with maximum weather delays.
  • ‘San Francisco to Los Angeles’ and ‘New York to Chicago’ were among the busiest routes.
  • Phoenix and Charlotte were the airports with maximum security delay at airport.
  • ‘Chicago to Los Angeles’ was the route with maximum weather delay.

Also, in addition to the above analysis, Regression techniques like linear regression can be used to identify dependencies between canceled flights, delays and other variables so as to avoid further inconvenience to passangers in future.