The project report is intended to perform basic data exploratory analysis on the Flights Cancellation dataset published by US Department of Transportation. The analysis aims at revealing hidden data trends and also facts like most visited Destination, common Cancellation reasons, dependencies among various variables if any, most reliable carrier etc.
For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.
## Load Required Packages ##
library(stringr) ## Numerous functions for text manipulation
library(data.table) # useful for functions like fread
library(dplyr) ## Manipulating data
library(tidyverse) ## Tidying data
library(ggplot2) ## Visualizing data
library(DT) ## Output data in nice format
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers.The Flights delay and cancellation data was collected and published by the DOT’s Bureau of Transportation Statistics which included summary information on the number of on-time, delayed, canceled, and diverted flights etc.
The following times are in the xx:yy - hour:minute format (e.g. 1536 means 3:36pm, 345 means 3:45am, 16 means 00:16am)
The following times are in minutes format (negatives mean actual_time is ahead of scheduled_time for the absolute value of that negative number)
Distance is in miles
The process of data preparation involved following steps:
#Importing files into R
Airlines <- fread("airlines.csv")
Airport <- fread("airports.csv")
Flights <- fread("flights.csv")
# removing missing values
Flights_refined <- Flights %>%
filter(!is.na(DEPARTURE_TIME) & !is.na(WHEELS_OFF) & !is.na(AIR_TIME) & !is.na(TAXI_IN))
The structure and summary of the refined set can be found as below:
## class missing.count min max
## YEAR integer 0 2015 2015
## MONTH integer 0 1 12
## DAY integer 0 1 31
## DAY_OF_WEEK integer 0 1 7
## AIRLINE character 0 AA WN
## FLIGHT_NUMBER integer 0 1 9320
## TAIL_NUMBER character 0 7819A N9EAMQ
## ORIGIN_AIRPORT character 0 10135 YUM
## DESTINATION_AIRPORT character 0 10135 YUM
## SCHEDULED_DEPARTURE integer 0 1 2359
## DEPARTURE_TIME integer 0 1 2400
## DEPARTURE_DELAY integer 0 -82 1988
## TAXI_OUT integer 0 1 225
## WHEELS_OFF integer 0 1 2400
## SCHEDULED_TIME integer 0 18 718
## ELAPSED_TIME integer 0 14 766
## AIR_TIME integer 0 7 690
## DISTANCE integer 0 31 4983
## WHEELS_ON integer 0 1 2400
## TAXI_IN integer 0 1 248
## SCHEDULED_ARRIVAL integer 0 1 2400
## ARRIVAL_TIME integer 0 1 2400
## ARRIVAL_DELAY integer 0 -87 1971
## DIVERTED integer 0 0 0
## CANCELLED integer 0 0 0
## CANCELLATION_REASON character 0
## AIR_SYSTEM_DELAY integer 4650569 <NA> <NA>
## SECURITY_DELAY integer 4650569 <NA> <NA>
## AIRLINE_DELAY integer 4650569 <NA> <NA>
## LATE_AIRCRAFT_DELAY integer 4650569 <NA> <NA>
## WEATHER_DELAY integer 4650569 <NA> <NA>
The data for flights will be explored for following aspects along with other hidden data trends: