Inferences and Predictions in Airline Delay

Problem Description

The dataset consists of information of US flights, related to routes taken along with different airline metrics. The task at hand is to:

  • Predict the probability of airline delay.
  • Binary Classification of flights as on time/delayed.

Team Information

Team Name: Golden Dawn
Team Number: 7
Team Members

NAME PGPID
ABHISHEK VERMA PGP34301
MRADUL VERMA PGP34026
ADITYA KUMAR PGP34251
RAJESWARA SATISH KOMMOJU PGP34336
MESHRAM JETKUMAR BABARAO PGP34325
PRASHANTH KONDURU PGP34372
ABHIJEET HANSDA PGP34249

Data Wrangling

Remove unnessesary columns

  • X1 & V1: Not comprehensible.
  • Southwest, American, Delta, United, Alaska, Jetblue, Skywest, Others: Can be inferred indirectly from airline variable.
  • Weekend: Can be inferred from daywindow variable.

Dataset Description

Name Description
FlightDate Date of Flight(yyyymmdd)
Reporting_Airline Unique Carrier Code
Origin Origin Airport
Dest Destination Airport
CRSDepTime CRS Departure Time (local time: hhmm)
DepTime Actual Departure Time (local time: hhmm)
DepDelay Difference in delay between scheduled & actual
DepDelayMinutes Difference in minutes between scheduled and actual departure time.
DepDel15 Departure Delay Indicator, 15 Minutes or More (1=Yes)
DepartureDelayGroups Departure Delay Groups
DepTimeBlk Departure Time Block
CRSArrTime CRS Arrival Time (local time: hhmm)
ArrTime Actual Arrival Time (local time: hhmm)
ArrDelay Difference in minutes between scheduled and actual arrival
ArrDelayMinutes Difference in minutes between scheduled and actual arrival time.
ArrDel15 Arrival Delay Indicator, 15 Minutes or More (1=Yes)
ArrivalDelayGroups Arrival Delay Groups
ArrTimeBlk Arrival time block
CRSElapsedTime CRS Elapsed Time of Flight, in Minutes
ActualElapsedTime Elapsed Time of Flight, in minutes
AirTime Airtime of Flight, in minutes
Distance Distance travelled by Flight along the route
DepStatus Status of flight departure
ArrStatus Status of Flight arrival
airline Airline operating the Flight
timewindowdep Time window of departure
timewindowarr Time window of arrival
daywindow Weekday/Weekend indicator

The chosen target variable is ArrStatus

Exploratory Data Analysis

Overall Info

Dimensions of the dataframe

## [1] 1774818      28

Column names

##  [1] "FlightDate"           "Reporting_Airline"    "Origin"              
##  [4] "Dest"                 "CRSDepTime"           "DepTime"             
##  [7] "DepDelay"             "DepDelayMinutes"      "DepDel15"            
## [10] "DepartureDelayGroups" "DepTimeBlk"           "CRSArrTime"          
## [13] "ArrTime"              "ArrDelay"             "ArrDelayMinutes"     
## [16] "ArrDel15"             "ArrivalDelayGroups"   "ArrTimeBlk"          
## [19] "CRSElapsedTime"       "ActualElapsedTime"    "AirTime"             
## [22] "Distance"             "DepStatus"            "ArrStatus"           
## [25] "airline"              "timewindowdep"        "timewindowarr"       
## [28] "daywindow"

Structure of the dataframe

## Classes 'tbl_df', 'tbl' and 'data.frame':    1774818 obs. of  28 variables:
##  $ FlightDate          : Date, format: "2018-10-21" "2018-10-22" ...
##  $ Reporting_Airline   : Factor w/ 17 levels "9E","AA","AS",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Origin              : Factor w/ 350 levels "ABE","ABI","ABQ",..: 183 183 183 183 183 183 183 215 215 215 ...
##  $ Dest                : Factor w/ 350 levels "ABE","ABI","ABQ",..: 71 71 71 71 71 71 71 71 71 71 ...
##  $ CRSDepTime          : num  1123 1123 1123 1123 1123 ...
##  $ DepTime             : num  1124 1117 1358 1125 1248 ...
##  $ DepDelay            : num  1 -6 155 2 85 48 10 -5 0 -2 ...
##  $ DepDelayMinutes     : num  1 0 155 2 85 48 10 0 0 0 ...
##  $ DepDel15            : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 1 1 ...
##  $ DepartureDelayGroups: Factor w/ 15 levels "-2","-1","0",..: 3 2 13 3 8 6 3 2 3 2 ...
##  $ DepTimeBlk          : Factor w/ 19 levels "0001-0559","0600-0659",..: 7 7 7 7 7 7 7 4 4 4 ...
##  $ CRSArrTime          : num  1910 1910 1910 1910 1910 ...
##  $ ArrTime             : num  1919 1927 2133 1922 2009 ...
##  $ ArrDelay            : num  9 17 143 12 59 30 3 1 9 28 ...
##  $ ArrDelayMinutes     : num  9 17 143 12 59 30 3 1 9 28 ...
##  $ ArrDel15            : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 1 2 ...
##  $ ArrivalDelayGroups  : Factor w/ 15 levels "-2","-1","0",..: 3 4 12 3 6 5 3 3 3 4 ...
##  $ ArrTimeBlk          : Factor w/ 19 levels "0001-0559","0600-0659",..: 15 15 15 15 15 15 15 6 6 6 ...
##  $ CRSElapsedTime      : num  287 287 287 287 287 287 287 127 124 124 ...
##  $ ActualElapsedTime   : num  295 310 275 297 261 269 280 133 133 154 ...
##  $ AirTime             : num  250 244 246 257 234 241 256 96 97 100 ...
##  $ Distance            : num  2125 2125 2125 2125 2125 ...
##  $ DepStatus           : Ord.factor w/ 2 levels "OntimeDep"<"DelayedonDeparture": 2 1 2 2 2 2 2 1 1 1 ...
##  $ ArrStatus           : Ord.factor w/ 2 levels "OntimeArr"<"DelayedonArrival": 2 2 2 2 2 2 2 2 2 2 ...
##  $ airline             : Factor w/ 11 levels "Alaska","American",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ timewindowdep       : Factor w/ 2 levels "AM","PM": 1 1 1 1 1 1 1 1 1 1 ...
##  $ timewindowarr       : Factor w/ 2 levels "AM","PM": 2 2 2 2 2 2 2 1 1 1 ...
##  $ daywindow           : Factor w/ 2 levels "Weekday","Weekend": 2 1 1 1 2 1 1 1 1 1 ...

Structure of the dataframe

##    FlightDate         Reporting_Airline     Origin       
##  Min.   :2018-10-01   WN     :336886    ATL    :  95314  
##  1st Qu.:2018-10-23   DL     :232804    ORD    :  83053  
##  Median :2018-11-15   AA     :226578    DFW    :  67514  
##  Mean   :2018-11-15   OO     :190815    DEN    :  58193  
##  3rd Qu.:2018-12-08   UA     :157853    CLT    :  56584  
##  Max.   :2018-12-31   YX     : 76791    LAX    :  54071  
##                       (Other):553091    (Other):1360089  
##       Dest           CRSDepTime      DepTime        DepDelay       
##  ATL    :  95513   Min.   :   1   Min.   :   1   Min.   :-122.000  
##  ORD    :  82963   1st Qu.: 915   1st Qu.: 918   1st Qu.:  -6.000  
##  DFW    :  67265   Median :1317   Median :1325   Median :  -2.000  
##  DEN    :  58236   Mean   :1326   Mean   :1331   Mean   :   8.398  
##  CLT    :  56523   3rd Qu.:1730   3rd Qu.:1738   3rd Qu.:   5.000  
##  LAX    :  54114   Max.   :2359   Max.   :2400   Max.   :2109.000  
##  (Other):1360204                                                   
##  DepDelayMinutes   DepDel15    DepartureDelayGroups     DepTimeBlk     
##  Min.   :   0.00   0:1474211   -1     :1082733      0600-0659: 127374  
##  1st Qu.:   0.00   1: 300607   0      : 380323      0700-0759: 120254  
##  Median :   0.00               1      : 112659      0800-0859: 117440  
##  Mean   :  11.69               2      :  57943      1700-1759: 114802  
##  3rd Qu.:   5.00               3      :  35380      1200-1259: 113553  
##  Max.   :2109.00               4      :  23190      1100-1159: 111446  
##                                (Other):  82590      (Other)  :1069949  
##    CRSArrTime      ArrTime        ArrDelay        ArrDelayMinutes  
##  Min.   :   1   Min.   :   1   Min.   :-120.000   Min.   :   0.00  
##  1st Qu.:1105   1st Qu.:1056   1st Qu.: -14.000   1st Qu.:   0.00  
##  Median :1516   Median :1510   Median :  -6.000   Median :   0.00  
##  Mean   :1491   Mean   :1476   Mean   :   3.791   Mean   :  12.14  
##  3rd Qu.:1919   3rd Qu.:1916   3rd Qu.:   7.000   3rd Qu.:   7.00  
##  Max.   :2400   Max.   :2400   Max.   :2153.000   Max.   :2153.00  
##                                                                    
##  ArrDel15    ArrivalDelayGroups     ArrTimeBlk      CRSElapsedTime 
##  0:1452702   -1     :719858     1600-1659: 111087   Min.   :-99.0  
##  1: 322116   -2     :389134     1000-1059: 110887   1st Qu.: 88.0  
##              0      :343710     1300-1359: 110013   Median :122.0  
##              1      :128576     1900-1959: 109082   Mean   :140.9  
##              2      : 62021     1800-1859: 108528   3rd Qu.:170.0  
##              3      : 36161     1700-1759: 108492   Max.   :703.0  
##              (Other): 95358     (Other)  :1116729                  
##  ActualElapsedTime    AirTime         Distance     
##  Min.   : 14.0     Min.   :  7.0   Min.   :  31.0  
##  1st Qu.: 84.0     1st Qu.: 60.0   1st Qu.: 363.0  
##  Median :118.0     Median : 92.0   Median : 631.0  
##  Mean   :136.2     Mean   :111.3   Mean   : 797.1  
##  3rd Qu.:166.0     3rd Qu.:141.0   3rd Qu.:1028.0  
##  Max.   :723.0     Max.   :695.0   Max.   :4983.0  
##                                                    
##               DepStatus                  ArrStatus      
##  OntimeDep         :1186200   OntimeArr       :1145617  
##  DelayedonDeparture: 588618   DelayedonArrival: 629201  
##                                                         
##                                                         
##                                                         
##                                                         
##                                                         
##       airline       timewindowdep timewindowarr   daywindow      
##  Others   :399193   AM: 740818    AM: 540856    Weekday:1302690  
##  Southwest:336886   PM:1034000    PM:1233962    Weekend: 472128  
##  Delta    :232804                                                
##  American :226578                                                
##  Skywest  :190815                                                
##  United   :157853                                                
##  (Other)  :230689

Discrete Data Analysis

Proportion of On Time/Delayed flights grouped by airline

##            ArrStatus
## airline     OntimeArr DelayedonArrival
##   Alaska         2.26             1.24
##   American       7.81             4.95
##   Delta          9.58             3.54
##   Frontier       0.94             0.75
##   Hawaiian       0.84             0.35
##   Jetblue        2.47             1.72
##   Others        14.36             8.13
##   Skywest        7.02             3.73
##   Southwest     11.94             7.04
##   Spirit         1.71             0.71
##   United         5.60             3.29

Proportion of On Time/Delayed flights grouped by Weekday/Weekend status

##          ArrStatus
## daywindow OntimeArr DelayedonArrival
##   Weekday     72.43            75.17
##   Weekend     27.57            24.83

Descriptive Statistics of important continuous variables

Average airtime & flight distance for On Time/Delayed flights

Proportion of On Time/Delayed flights grouped by reporting airline

##                  ArrStatus
## Reporting_Airline OntimeArr DelayedonArrival
##                9E      0.02             0.01
##                AA      0.08             0.05
##                AS      0.02             0.01
##                B6      0.02             0.02
##                DL      0.10             0.04
##                EV      0.01             0.01
##                F9      0.01             0.01
##                G4      0.01             0.00
##                HA      0.01             0.00
##                MQ      0.03             0.02
##                NK      0.02             0.01
##                OH      0.03             0.01
##                OO      0.07             0.04
##                UA      0.06             0.03
##                WN      0.12             0.07
##                YV      0.02             0.01
##                YX      0.03             0.01

Proportion of On Time/Delayed flights grouped by time window arrival

##              ArrStatus
## timewindowarr OntimeArr DelayedonArrival
##            AM      0.70             0.30
##            PM      0.62             0.38

Proportion of On Time/Delayed flights grouped by time window departure

##              ArrStatus
## timewindowdep OntimeArr DelayedonArrival
##            AM      0.69             0.31
##            PM      0.61             0.39

Mean plot of On Time/Delayed flights, split by distance

Mean plot of On Time/Delayed flights, split by airtime

Route Analysis

Worst routes (Delayed more than 85% of the time)

## # A tibble: 11,546 x 3
## # Groups:   route [5,812]
##    route   ArrStatus        count
##    <chr>   <ord>            <int>
##  1 ABE-ATL OntimeArr          143
##  2 ABE-ATL DelayedonArrival    99
##  3 ABE-CLT OntimeArr          143
##  4 ABE-CLT DelayedonArrival   100
##  5 ABE-DTW OntimeArr          174
##  6 ABE-DTW DelayedonArrival    83
##  7 ABE-FLL OntimeArr            1
##  8 ABE-FLL DelayedonArrival     7
##  9 ABE-MYR OntimeArr            6
## 10 ABE-MYR DelayedonArrival     3
## # ... with 11,536 more rows
## # A tibble: 5,812 x 3
## # Groups:   route [5,812]
##    route   OntimeArr DelayedonArrival
##    <chr>       <int>            <int>
##  1 ABE-ATL       143               99
##  2 ABE-CLT       143              100
##  3 ABE-DTW       174               83
##  4 ABE-FLL         1                7
##  5 ABE-MYR         6                3
##  6 ABE-ORD        72               39
##  7 ABE-PGD        27               10
##  8 ABE-PHL        30               10
##  9 ABE-PIE        29                8
## 10 ABE-SFB        72               29
## # ... with 5,802 more rows

Date Analysis

Total flights On Time/Delayed grouped by month

Total flights On Time/Delayed grouped by day of month

Correlation

Correlation matrix between important continuous variables

##                       AirTime      Distance ActualElapsedTime
## AirTime           1.000000000  0.9812425334        0.98788870
## Distance          0.981242533  1.0000000000        0.96674226
## ActualElapsedTime 0.987888697  0.9667422574        1.00000000
## ArrDelayMinutes   0.010204164 -0.0005893142        0.03899857
## DepDelayMinutes   0.004144378  0.0045386600        0.01216335
##                   ArrDelayMinutes DepDelayMinutes
## AirTime              0.0102041639     0.004144378
## Distance            -0.0005893142     0.004538660
## ActualElapsedTime    0.0389985688     0.012163348
## ArrDelayMinutes      1.0000000000     0.974960662
## DepDelayMinutes      0.9749606618     1.000000000

Correlation chart between important continous variables

Analysis Summary

  • Around 35% of all flights are delayed which is an important issue to be analyzed further.
  • The delay is dependent on the airline, where Southwest & Other airlines are delayed the most. Sprint, Alaska, American are less delayed than others.
  • Flights are delayed more on Weekdays than on weekends.
  • Flight delay doesn’t quite depend on the airtime or flight distance.
  • Flights are delayed more when they Arrive post 12 PM & Depart post 12 PM.
  • Departure delay groups of -1,0,1,2,3 contribute to more number of delays.
  • Flights are delayed more during the departure time block of 1600 - 2000 HRS.
  • Some reporting airlines(WN, AA, DL) contribute to more delay than others.
  • For 15 routes the delay of the flights is more than 85%.
  • For 4 routes the delay of the flights is less than 5%.
  • There is no significant difference in flight delays based on month of the year.
  • 31st of the month is the day when least number of delays occur.
  • Departure Delay & Arrival Delay are linearly related.
  • Arrival Delays are reduced with increase in Departure Delays.
  • AirTime, Distance, ActualElapsedTime are highly correlated. And same is with ArrDelayMinutes & DepartureDelayMinutes.

Team: #7 (Golden Dawn)

7/30/2019