Objective
The project report is intended to perform basic data exploratory analysis on the Flights Cancellation dataset published by US Department of Transportation. The analysis aims at revealing hidden data trends and also facts like most visited Destination, common Cancellation reasons, dependencies among various variables if any, most reliable carrier etc.
Data Source
The U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers.The Flights delay and cancellation data was collected and published by the DOT’s Bureau of Transportation Statistics which included summary information on the number of on-time, delayed, canceled, and diverted flights etc.
Data Format
The following times are in the xx:yy - hour:minute format (e.g. 1536 means 3:36pm, 345 means 3:45am, 16 means 00:16am)
The following times are in minutes format (negatives mean actual_time is ahead of scheduled_time for the absolute value of that negative number)
Distance is in miles
For this project, the majority of packages used are the standard ones for collecting, tidying, and analyzing data.
## Load Required Packages ##
library(tidyverse)
library(stringr) ## Numerous functions for text manipulation
library(data.table) # useful for functions like fread
library(tidyverse) ## Tidying data
library(DT) ## Output data in nice format
library(knitr) ## Format tables
library(kableExtra) ## Add extra formatting options to tables
library("wordcloud") ## to create a word cloud
library("RColorBrewer") ## color palletes
library(plotly) ## Used to plot interactive publication quality charts
library(treemap) ## to create a tree map
library(maps) ## to create maps
library(geosphere)
library(highcharter) ## to create interactive treemaps
The process of data preparation involved following steps:
#Importing files into R
Airlines <- fread("airlines.csv")
Airport <- fread("airports.csv")
Flights <- fread("flights.csv")
# removing missing values
Flights_refined <- Flights %>%
filter(!is.na(DEPARTURE_TIME) & !is.na(WHEELS_OFF) & !is.na(AIR_TIME) & !is.na(TAXI_IN) | DIVERTED == 1) %>%
as.data.table()
The structure and summary of the refined set can be found as below:
| Attributes | Class | Missing_Values | Min_Value | Max_Value |
|---|---|---|---|---|
| YEAR | integer | 0 | 2015 | 2015 |
| MONTH | integer | 0 | 1 | 12 |
| DAY | integer | 0 | 1 | 31 |
| DAY_OF_WEEK | integer | 0 | 1 | 7 |
| AIRLINE | character | 0 | AA | WN |
| FLIGHT_NUMBER | integer | 0 | 1 | 9320 |
| TAIL_NUMBER | character | 0 | 7819A | N9EAMQ |
| ORIGIN_AIRPORT | character | 0 | 10135 | YUM |
| DESTINATION_AIRPORT | character | 0 | 10135 | YUM |
| SCHEDULED_DEPARTURE | integer | 0 | 1 | 2359 |
| DEPARTURE_TIME | integer | 0 | 1 | 2400 |
| DEPARTURE_DELAY | integer | 0 | -82 | 1988 |
| TAXI_OUT | integer | 0 | 1 | 225 |
| WHEELS_OFF | integer | 0 | 1 | 2400 |
| SCHEDULED_TIME | integer | 1 | NA | NA |
| ELAPSED_TIME | integer | 15187 | NA | NA |
| AIR_TIME | integer | 15187 | NA | NA |
| DISTANCE | integer | 0 | 31 | 4983 |
| WHEELS_ON | integer | 2629 | NA | NA |
| TAXI_IN | integer | 2629 | NA | NA |
| SCHEDULED_ARRIVAL | integer | 0 | 1 | 2400 |
| ARRIVAL_TIME | integer | 2629 | NA | NA |
| ARRIVAL_DELAY | integer | 15187 | NA | NA |
| DIVERTED | integer | 0 | 0 | 1 |
| CANCELLED | integer | 0 | 0 | 0 |
| CANCELLATION_REASON | character | 0 | ||
| AIR_SYSTEM_DELAY | integer | 4665756 | NA | NA |
| SECURITY_DELAY | integer | 4665756 | NA | NA |
| AIRLINE_DELAY | integer | 4665756 | NA | NA |
| LATE_AIRCRAFT_DELAY | integer | 4665756 | NA | NA |
| WEATHER_DELAY | integer | 4665756 | NA | NA |
Minimum and maximum Values for Character Variables will reflect less meaning
The data for flights was explored for hidden trends and useful insights on the following aspects:
There were a total of 14 unique carriers which had flights operating in US in 2015:
Airline_max_flights <- Flights_refined %>%
group_by(AIRLINE) %>%
summarize(no_flights = n()) %>%
select(AIRLINE,no_flights) %>%
left_join(Airlines,by = c("AIRLINE" = "IATA_CODE")) %>%
select(AIRLINE.y, no_flights)
hchart(Airline_max_flights, "treemap", hcaes(x = AIRLINE.y, value = no_flights, color =no_flights)) %>%
hc_colorAxis(stops = color_stops(n = 10, colors = c("#2471A3", "#D5D8DC", "#FDE725"))) %>%
hc_title(text = "Operating Airlines by no of Flights") %>%
hc_add_theme(hc_theme_google()) %>%
hc_credits(enabled = TRUE, text = "Source : FAO", style = list(fontSize = "10px"))
sample code for each Airline
Flights_network <- Flights_refined %>%
select(AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT) %>%
left_join(Airlines,by=c("AIRLINE" = "IATA_CODE")) %>%
select(AIRLINE.y,ORIGIN_AIRPORT,DESTINATION_AIRPORT)
names(Flights_network)[1] <- "AIRLINE"
Flights_network<- Flights_network%>%
left_join(Airport, by=c("ORIGIN_AIRPORT"="IATA_CODE")) %>%
filter(!is.na(AIRPORT))
names(Flights_network)[names(Flights_network) == 'LATITUDE'] <- 'ORIGIN_LATITUDE'
names(Flights_network)[names(Flights_network) == 'LONGITUDE'] <- 'ORIGIN_LONGITUDE'
Flights_network <- Flights_network%>%
left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE"))
names(Flights_network)[names(Flights_network) == 'LATITUDE'] <- 'DESTINATION_LATITUDE'
names(Flights_network)[names(Flights_network) == 'LONGITUDE'] <- 'DESTINATION_LONGITUDE'
Flights_network<- Flights_network%>%
group_by(AIRLINE,CITY.y,CITY.x,ORIGIN_LONGITUDE,ORIGIN_LATITUDE,DESTINATION_LATITUDE,DESTINATION_LONGITUDE) %>%
summarize(no_flights = n()) %>%
filter(!is.na(ORIGIN_LONGITUDE) & !is.na(DESTINATION_LONGITUDE))
xlim <- c(-176.64603, -56.601563)
ylim <- c(12.039321, 71.856229)
map("world", col="#191919", fill=TRUE, bg="#000000", lwd=0.05, xlim=xlim, ylim=ylim)
pal <- colorRampPalette(c("#333333", "white", "#1292db"))
colors <- pal(100)
a<- Flights_network %>%
filter(AIRLINE == "Virgin America")
for (j in 1:nrow(a)) {
inter <- gcIntermediate( c(a[j,]$ORIGIN_LONGITUDE,a[j,]$ORIGIN_LATITUDE), c(a[j,]$DESTINATION_LONGITUDE,a[j,]$DESTINATION_LATITUDE), n=100, addStartEnd=TRUE)
colindex <- round( (a[j,]$no_flights / max(a$no_flights)) * length(colors) )
lines(inter, col=colors[colindex], lwd=0.8)
}
diverted_flights <- Flights_refined %>%
filter(DIVERTED > 0 ) %>%
left_join(Airlines,by=c("AIRLINE" = "IATA_CODE")) %>%
group_by(AIRLINE.y) %>%
summarise(no_flights = n())
top_10_diverted_Flights <- diverted_flights %>%
top_n(n = 10, wt = no_flights) %>%
ggplot() +
geom_bar(mapping = aes(x=reorder(AIRLINE.y, no_flights),
y=no_flights),
stat="identity", fill = "light green") +
coord_flip() +
labs(title="Top 10 Airlines experiencing maximum diversions in flights",
x="Carriers", y="no of Flights diverted") +
theme_minimal()
top_10_diverted_Flights
Flights_Delayed <- Flights_refined %>%
select(AIRLINE,ARRIVAL_DELAY) %>%
mutate(DELAYED = ARRIVAL_DELAY > 10 & !(is.na(ARRIVAL_DELAY))) %>%
group_by(AIRLINE) %>%
summarize(PERCENTAGE_DELAYED = sum(DELAYED)/n()) %>%
left_join(Airlines,by = c("AIRLINE" = "IATA_CODE")) %>%
select(AIRLINE.y, PERCENTAGE_DELAYED)
Flights_Delayed_df<- rbind(Flights_Delayed%>% mutate(group = "Delayed"),Flights_Delayed%>%
mutate(PERCENTAGE_DELAYED = 1 - PERCENTAGE_DELAYED, group = "OnTime") %>%
select( PERCENTAGE_DELAYED, group, AIRLINE.y))
ggplot(data = Flights_Delayed_df, aes(x = Flights_Delayed_df$AIRLINE.y, y = Flights_Delayed_df$PERCENTAGE_DELAYED, fill = Flights_Delayed_df$group)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("Airlines") +
ylab("Percentage of Delayed Flights") +
labs(fill = '') +
theme_bw()
The Flights data was analyzed with respect to the air-routes on which they flew in 2015 accross US, i.e. Origin and Destination airports. The purpose of this analysis was to uncover trends like most prefered destinations, various kind of delays on origin and destination airports, routes with maximum air traffic etc.
Preferred Destinations
The data was further explored to find the destination airports with maximum number of flights. These were the points with maximum air traffic density.
Preferred_destination<- Flights_refined %>%
left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE")) %>%
group_by(CITY) %>%
summarize(no_flights =n())
#building the word cloud
wordcloud(words = Preferred_destination$CITY, freq = Preferred_destination$no_flights, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"),main="Most Preferred Destinations")
Weather Delay
The data was then explored to uncover routes with maximum weather delay. This insight could be useful to keep a time buffer while travelling on these routes as delay in arrival is expected on them.
Flights_routes <- select(Flights_refined, ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_TIME,DEPARTURE_DELAY,ARRIVAL_DELAY
, DISTANCE,CANCELLED,DIVERTED,WEATHER_DELAY) %>%
filter(!is.na(WEATHER_DELAY))
Weather_delay<- Flights_routes %>%
group_by(ORIGIN_AIRPORT,DESTINATION_AIRPORT) %>%
summarize(WEATHER_DELAY =sum(WEATHER_DELAY>0)) %>%
left_join(Airport, by=c("ORIGIN_AIRPORT"="IATA_CODE")) %>%
filter(!is.na(AIRPORT)) %>%
left_join(Airport, by=c("DESTINATION_AIRPORT"="IATA_CODE")) %>%
filter(!is.na(AIRPORT.x) & !is.na(AIRPORT.y)) %>%
arrange(desc(WEATHER_DELAY)) %>%
head(n = 25) %>%
mutate(Route = paste(CITY.x,CITY.y, sep = " - "))
hchart(Weather_delay, "treemap", hcaes(x = Route, value = WEATHER_DELAY, color =WEATHER_DELAY)) %>%
hc_colorAxis(stops = color_stops(n = 10, colors = c("#F4D03F", "#229954", "#D35400"))) %>%
hc_title(text = "Routes with maximum weather delay") %>%
hc_add_theme(hc_theme_google()) %>%
hc_credits(enabled = TRUE, text = "Source : FAO", style = list(fontSize = "10px"))
The data was also analyzed in the aspect of variation in flights attributes with respect to time in the year. There were insights such as the distribution of number of flights over all the months in 2015 grouped by carrier.
Monthly flights’ variation
The data was then grouped by Months to visualize the trend of variation in number of flights in 2015. The data was further grouped by carrier within the month so as to observe the trend by Airline carriers.
Weather Delays
The data was further analyzed on variation in number of weather delays accross the months in 2015. It was found that Februray observed maximum number of Weather Delays.
After the above data analysis, the following insights were obtained from the Flight cancellation and delays data for the year 2015:
Also, in addition to the above analysis, Regression techniques like linear regression can be used to identify dependencies between canceled flights, delays and other variables so as to avoid further inconvenience to passangers in future.