Analyzing the dataset having data regarding delay and cancellation of flights in year 2015.Delays and Cancellations lead to huge loss to the airline industry. Hence minutely analyzing factors affecting it will be of great use in improvising air services and can also be used while developing new airports. Time is a invaluable resource. Hence reducing delays and cancellation will increase the productivity. This paper addresses the issue of delay and cancellation. Helping traveller to see which month has maximum delays/cancellations or which ailines faces these kind of issues the most. This will help them to make a better choice and have a better experience next time.
Objective of this study is to find which factors result to maximum delay. In this detailed dataset we narrow down our goals to 3 parts Part1:Factors affecting cancellation Part2:Factors affecting departure delay Part3:Factors affecting arrival delay TO achieve so we do the following 1.Loading Libraries 2.Reading the datasets 3.Cleaning the data 4.Finding factors of cancellation 5.Delay Ratios 6.Departure delay analysis 7.Arrival delay analysis 8.Regression models
First we overall find the percentage of cancelled flights from the cancelled coloumn in the dataset which uses 1 to indicate cancelled flights and 0 to indicate non cancelled flights.1.57% of overall flights were cancelled. To get a deeper insight we divide the reasons causing cancellation into 4 segments 1)Airline or carrier reasons 2)Weather reasons 3)National Air System reasons 4)Security Reasons. On forming a two way table we come to know that 1 out of every 2 flights is cancelled due to weather reasons (ie 50%). 28% cancellations were due to airline isuue and 17% due Air System factors. Now we see which month of the year is bound to have many cancellations from this data set.On analyzing we draw this conclusion that most of the cancellations are faced during the months of January, February and March. This is very much in accordance to seasons in America.Due to heavy snowfall in winter alot of flights get cancelle din this month It is also observed that mostly cancellations take place at the start of the week ie Monday or Tuesday.
We add a new coloumn to the dataset called ‘status’ This coloumn specifies one of the five statuses ie 1)Departure delay 2)Arrival Delay 3)Arrival and Departure delay 4) Cancelled 5) On time flights On forming a one way contingency table we observe: 52% flights were on time 26% flights had departure delays 9% flights had arrival delays 9% flights had arrival and departure delay 1.54% flights were cancelled
As we previously saw 26% flighs had departure delays and 9 % had both departure and arrival delays. Hence (26+9) 35% of flights were delayed during departure We form a new dataset only with departure delayed flights. We then divide delays based on time into various segments and observe of all the delayed flights 41% flights were delayed by less tha 10 mins 28% flights were delayed by 10 to 30 mins 14% flights were delayed by 30 to 60 mins 6% flights were delayed by 60 to 90 mins 3% flights were delayed by 90 to 120 mins 3% flights were delayed by 120 to 180 mins 2% flights were delayed by 180+ mins To ease into understanding data effectively we consider 5 mins of delay as On time and categorize 5 to 15 mins delay as small delay and over 15 mins as long delay After doing so we see that almost 50% of flights are delayed by more than 15 mins and 25% had small delays approximately Remaining we considered On Time as there was hardlly and delay On going down the funnel we come to know that maximum long delays(ie 50% share of all delays) come in the months of June July August And Maximum Long delays are seen in Southwest Airlines Co.(25%) folowed by American Airlines Inc.,Delta Air Lines Inc.,United Air Lines Inc. Similar trend is seen for small delays and on time flights too. But these percentages for airlines is without considering no. of flights Hence it wont be justified to analyze which flight has maximum delay this way Hence we go for frequency and on considering no. of flights we observe very different result ie airlines with highest delay is Frontier Airlines Inc. folowed by Spirit Air Lines and Atlantic Southeast Airlines
As we previously saw 9% flighs had departure delays and 9 % had both departure and arrival delays. Hence (9+9) 18% of flights were delayed during departure We form a new dataset only with departure delayed flights. We then divide delays based on time into various segments and observe Of all the Arrival delays 39% flights were delayed by less tha 10 mins 29% flights were delayed by 10 to 30 mins 15% flights were delayed by 30 to 60 mins 6% flights were delayed by 60 to 90 mins 3% flights were delayed by 90 to 120 mins 3% flights were delayed by 120 to 180 mins 2% flights were delayed by 180+ mins To ease into understanding data effectively we consider 5 mins of delay as On time and categorize 5 to 15 mins delay as small delay and over 15 mins as long delay After doing so we see that almost 50% of flights are delayed by more than 15 mins and 30% had small delays approximately Remaining we considered On Time as there was hardlly and delay On going down the funnel we come to know that maximum long delays(ie 50% share of all delays) come in the months of June July December And Maximum Long delays are seen in Southwest Airlines Co.(25%) folowed by American Airlines Inc.,Delta Air Lines Inc.,United Air Lines Inc. Similar trend is seen for small delays and on time flights too. But these percentages for airlines is without considering no. of flights Hence it wont be justified to analyze which flight has maximum delay this way Hence we go for frequency and on considering no. of flights we observe very different result ie airlines with highest delay is Frontier Airlines Inc. folowed by Spirit Air Lines and Atlantic Southeast Airlines
Independent variables:AIR_SYSTEM_DELAY,WEATHER_DELAY,LATE_AIRCRAFT_DELAY,AIRLINE_DELAY,SECURITY_DELAY Dependent variable : Departure delay Hence we can predict departure delay based on other delay This model is quite good at predicting delays has Adjusted R-squared is 0.9402
Independent variables:AIR_SYSTEM_DELAY,WEATHER_DELAY,LATE_AIRCRAFT_DELAY,AIRLINE_DELAY,SECURITY_DELAY Dependent variable : Arrival delay Hence we can predict arrival delay based on other delay This model is quite good at predicting delays has Adjusted R-squared is 0.9402
Independent variable: Departure delay Dependent variable: Arrival Delay Hence Arrival Delay can be predicted with Departure Delay has multiple R squared = 0.8924