Project Proposal

Analysis on Flight Cancellation and Delays

Objective

The project report is intended to perform basic data exploratory analysis on the Flights Cancellation dataset published by US Department of Transportation. The analysis aims at revealing hidden data trends and also facts like most visited Destination, common Cancellation reasons, dependencies among various variables if any, most reliable carrier etc.

Packages Required

For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.

## Load Required Packages ##
library(stringr) ## Numerous functions for text manipulation
library(data.table) # useful for functions like fread
library(dplyr) ## Manipulating data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(DT) ## Output data in nice format

Data Source

The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers.The Flights delay and cancellation data was collected and published by the DOT’s Bureau of Transportation Statistics which included summary information on the number of on-time, delayed, canceled, and diverted flights etc.

Data Format

The following times are in the xx:yy - hour:minute format (e.g. 1536 means 3:36pm, 345 means 3:45am, 16 means 00:16am)

scheduled_departure
departure_time
scheduled_arrival
arrival_time
wheels_off
wheels_on

The following times are in minutes format (negatives mean actual_time is ahead of scheduled_time for the absolute value of that negative number)

arrival_delay
departure_delay
taxi_in
taxi_out
scheduled_time
elapsed_time
air_time

Distance is in miles

Data Cleaning

The process of data preparation involved following steps:

Data was imported from three files named ‘Airports’, ‘Airlines’ and ‘Flights’ using fread.
Structure (str), Glimpse and Summary were examined of the raw data to understand it.
The data was then examined for missing values.
In Flights, there were 12 columns for which the values were NA.
A refined dataset was created by omitting NA values from 8 out of 12 columns. The reasons for exclusion of the 4 columns were first that, the values were missing in these for more than 80% of the rows and secondly, the columns were like AIR_SYSTEM_DELAY, SECURITY_DELAY which can hold NA values in case there is no delay from the original schedule.
It was further observed that majority of values for Cancellation reasons were empty.
Data for Airlines and Airports were also examined for NA values and it was observed that Airport had 3 NA values for latitude and longitude. These values were not removed as latitude and longitude were not required and also can be restored by searching for airports.

#Importing files into R
Airlines <- fread("airlines.csv")
Airport <- fread("airports.csv")
Flights <- fread("flights.csv")

# removing missing values
Flights_refined <- Flights %>%
                  filter(!is.na(DEPARTURE_TIME) & !is.na(WHEELS_OFF) & !is.na(AIR_TIME) & !is.na(TAXI_IN))

The structure and summary of the refined set can be found as below:

##                         class missing.count   min    max
## YEAR                  integer             0  2015   2015
## MONTH                 integer             0     1     12
## DAY                   integer             0     1     31
## DAY_OF_WEEK           integer             0     1      7
## AIRLINE             character             0    AA     WN
## FLIGHT_NUMBER         integer             0     1   9320
## TAIL_NUMBER         character             0 7819A N9EAMQ
## ORIGIN_AIRPORT      character             0 10135    YUM
## DESTINATION_AIRPORT character             0 10135    YUM
## SCHEDULED_DEPARTURE   integer             0     1   2359
## DEPARTURE_TIME        integer             0     1   2400
## DEPARTURE_DELAY       integer             0   -82   1988
## TAXI_OUT              integer             0     1    225
## WHEELS_OFF            integer             0     1   2400
## SCHEDULED_TIME        integer             0    18    718
## ELAPSED_TIME          integer             0    14    766
## AIR_TIME              integer             0     7    690
## DISTANCE              integer             0    31   4983
## WHEELS_ON             integer             0     1   2400
## TAXI_IN               integer             0     1    248
## SCHEDULED_ARRIVAL     integer             0     1   2400
## ARRIVAL_TIME          integer             0     1   2400
## ARRIVAL_DELAY         integer             0   -87   1971
## DIVERTED              integer             0     0      0
## CANCELLED             integer             0     0      0
## CANCELLATION_REASON character             0             
## AIR_SYSTEM_DELAY      integer       4650569  <NA>   <NA>
## SECURITY_DELAY        integer       4650569  <NA>   <NA>
## AIRLINE_DELAY         integer       4650569  <NA>   <NA>
## LATE_AIRCRAFT_DELAY   integer       4650569  <NA>   <NA>
## WEATHER_DELAY         integer       4650569  <NA>   <NA>

Proposed Exploratory Data Analysis

The data for flights will be explored for following aspects along with other hidden data trends:

All the airline codes in airlines table will be joined on the airline’s code with Flights to obtain pointers such as carriers observing maximum delay, carriers having maximum flights and other insights on carrier performance.
All the airport codes from ‘Airports’ will be joined with the origin and destination codes from ‘Flights’ to obtain missing airport codes if any, and also to uncover data trends like most visited destination, routes experiencing most delay etc.
Various kinds of delays can be examined for insights such about most prominent delay reasons, mean arrival delay for each carrier etc.
Various cancellation reasons can be examined to find the most prominent reasons behind flight cancellation.
Various plots like Boxplots (which can be used to study outliers as well as maximum variation among variables), histogram (to study the distribution of various variables), line charts and Scatter plots(to study the dependencies and relationship among various variables) will be utilized to study and uncover the data trends in the existing data set.
Regression techniques like linear regression can be used to identify dependencies between canceled flights, delays and other variables so as to avoid further inconvenience to passangers in future.