Inferences and Predictions in Airline Delay
Machine Learning In Marketing Final Project
Inferences and Predictions in Airline Delay
- Problem Description
- Team Information
- Initialization
- Data Wrangling
- Dataset Description
- Exploratory Data Analysis
- Overall Info
- Discrete Data Analysis
- Proportion of On Time/Delayed Flights
- Proportion of On Time/Delayed flights grouped by airline
- Proportion of On Time/Delayed flights grouped by Weekday/Weekend status
- Descriptive Statistics of important continuous variables
- Average airtime & flight distance for On Time/Delayed flights
- Proportion of On Time/Delayed flights grouped by reporting airline
- Proportion of On Time/Delayed flights grouped by time window arrival
- Proportion of On Time/Delayed flights grouped by time window departure
- Box plot of On Time/Delayed flights, split by distance
- Box plot of On Time/Delayed flights, split by airtime
- Mean plot of On Time/Delayed flights, split by distance
- Mean plot of On Time/Delayed flights, split by airtime
- Plot for flight arrival status grouped by departure delay groups
- Plot for number of flights delayed grouped by departure time block
- Plot for number of flights delayed grouped by reporting airline
- Route Analysis
- Date Analysis
- Continous Data Analysis
- Correlation
- Analysis Summary
Problem Description
The dataset consists of information of US flights, related to routes taken along with different airline metrics. The task at hand is to:
- Predict the probability of airline delay.
- Binary Classification of flights as on time/delayed.
Team Information
Team Name: Golden Dawn
Team Number: 7
Team Members
| NAME | PGPID |
|---|---|
| ABHISHEK VERMA | PGP34301 |
| MRADUL VERMA | PGP34026 |
| ADITYA KUMAR | PGP34251 |
| RAJESWARA SATISH KOMMOJU | PGP34336 |
| MESHRAM JETKUMAR BABARAO | PGP34325 |
| PRASHANTH KONDURU | PGP34372 |
| ABHIJEET HANSDA | PGP34249 |
Initialization
Setup Packages
Read dataset
Setup optional parameters
Data Wrangling
Remove unnessesary columns
- X1 & V1: Not comprehensible.
- Southwest, American, Delta, United, Alaska, Jetblue, Skywest, Others: Can be inferred indirectly from airline variable.
- Weekend: Can be inferred from daywindow variable.
Convert to factor variables wherever necessary
Dataset Description
| Name | Description |
|---|---|
| FlightDate | Date of Flight(yyyymmdd) |
| Reporting_Airline | Unique Carrier Code |
| Origin | Origin Airport |
| Dest | Destination Airport |
| CRSDepTime | CRS Departure Time (local time: hhmm) |
| DepTime | Actual Departure Time (local time: hhmm) |
| DepDelay | Difference in delay between scheduled & actual |
| DepDelayMinutes | Difference in minutes between scheduled and actual departure time. |
| DepDel15 | Departure Delay Indicator, 15 Minutes or More (1=Yes) |
| DepartureDelayGroups | Departure Delay Groups |
| DepTimeBlk | Departure Time Block |
| CRSArrTime | CRS Arrival Time (local time: hhmm) |
| ArrTime | Actual Arrival Time (local time: hhmm) |
| ArrDelay | Difference in minutes between scheduled and actual arrival |
| ArrDelayMinutes | Difference in minutes between scheduled and actual arrival time. |
| ArrDel15 | Arrival Delay Indicator, 15 Minutes or More (1=Yes) |
| ArrivalDelayGroups | Arrival Delay Groups |
| ArrTimeBlk | Arrival time block |
| CRSElapsedTime | CRS Elapsed Time of Flight, in Minutes |
| ActualElapsedTime | Elapsed Time of Flight, in minutes |
| AirTime | Airtime of Flight, in minutes |
| Distance | Distance travelled by Flight along the route |
| DepStatus | Status of flight departure |
| ArrStatus | Status of Flight arrival |
| airline | Airline operating the Flight |
| timewindowdep | Time window of departure |
| timewindowarr | Time window of arrival |
| daywindow | Weekday/Weekend indicator |
The chosen target variable is ArrStatus
Exploratory Data Analysis
Overall Info
Column names
## [1] "FlightDate" "Reporting_Airline" "Origin"
## [4] "Dest" "CRSDepTime" "DepTime"
## [7] "DepDelay" "DepDelayMinutes" "DepDel15"
## [10] "DepartureDelayGroups" "DepTimeBlk" "CRSArrTime"
## [13] "ArrTime" "ArrDelay" "ArrDelayMinutes"
## [16] "ArrDel15" "ArrivalDelayGroups" "ArrTimeBlk"
## [19] "CRSElapsedTime" "ActualElapsedTime" "AirTime"
## [22] "Distance" "DepStatus" "ArrStatus"
## [25] "airline" "timewindowdep" "timewindowarr"
## [28] "daywindow"
Structure of the dataframe
## Classes 'tbl_df', 'tbl' and 'data.frame': 1774818 obs. of 28 variables:
## $ FlightDate : Date, format: "2018-10-21" "2018-10-22" ...
## $ Reporting_Airline : Factor w/ 17 levels "9E","AA","AS",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Origin : Factor w/ 350 levels "ABE","ABI","ABQ",..: 183 183 183 183 183 183 183 215 215 215 ...
## $ Dest : Factor w/ 350 levels "ABE","ABI","ABQ",..: 71 71 71 71 71 71 71 71 71 71 ...
## $ CRSDepTime : num 1123 1123 1123 1123 1123 ...
## $ DepTime : num 1124 1117 1358 1125 1248 ...
## $ DepDelay : num 1 -6 155 2 85 48 10 -5 0 -2 ...
## $ DepDelayMinutes : num 1 0 155 2 85 48 10 0 0 0 ...
## $ DepDel15 : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 1 1 ...
## $ DepartureDelayGroups: Factor w/ 15 levels "-2","-1","0",..: 3 2 13 3 8 6 3 2 3 2 ...
## $ DepTimeBlk : Factor w/ 19 levels "0001-0559","0600-0659",..: 7 7 7 7 7 7 7 4 4 4 ...
## $ CRSArrTime : num 1910 1910 1910 1910 1910 ...
## $ ArrTime : num 1919 1927 2133 1922 2009 ...
## $ ArrDelay : num 9 17 143 12 59 30 3 1 9 28 ...
## $ ArrDelayMinutes : num 9 17 143 12 59 30 3 1 9 28 ...
## $ ArrDel15 : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 1 2 ...
## $ ArrivalDelayGroups : Factor w/ 15 levels "-2","-1","0",..: 3 4 12 3 6 5 3 3 3 4 ...
## $ ArrTimeBlk : Factor w/ 19 levels "0001-0559","0600-0659",..: 15 15 15 15 15 15 15 6 6 6 ...
## $ CRSElapsedTime : num 287 287 287 287 287 287 287 127 124 124 ...
## $ ActualElapsedTime : num 295 310 275 297 261 269 280 133 133 154 ...
## $ AirTime : num 250 244 246 257 234 241 256 96 97 100 ...
## $ Distance : num 2125 2125 2125 2125 2125 ...
## $ DepStatus : Ord.factor w/ 2 levels "OntimeDep"<"DelayedonDeparture": 2 1 2 2 2 2 2 1 1 1 ...
## $ ArrStatus : Ord.factor w/ 2 levels "OntimeArr"<"DelayedonArrival": 2 2 2 2 2 2 2 2 2 2 ...
## $ airline : Factor w/ 11 levels "Alaska","American",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ timewindowdep : Factor w/ 2 levels "AM","PM": 1 1 1 1 1 1 1 1 1 1 ...
## $ timewindowarr : Factor w/ 2 levels "AM","PM": 2 2 2 2 2 2 2 1 1 1 ...
## $ daywindow : Factor w/ 2 levels "Weekday","Weekend": 2 1 1 1 2 1 1 1 1 1 ...
Structure of the dataframe
## FlightDate Reporting_Airline Origin
## Min. :2018-10-01 WN :336886 ATL : 95314
## 1st Qu.:2018-10-23 DL :232804 ORD : 83053
## Median :2018-11-15 AA :226578 DFW : 67514
## Mean :2018-11-15 OO :190815 DEN : 58193
## 3rd Qu.:2018-12-08 UA :157853 CLT : 56584
## Max. :2018-12-31 YX : 76791 LAX : 54071
## (Other):553091 (Other):1360089
## Dest CRSDepTime DepTime DepDelay
## ATL : 95513 Min. : 1 Min. : 1 Min. :-122.000
## ORD : 82963 1st Qu.: 915 1st Qu.: 918 1st Qu.: -6.000
## DFW : 67265 Median :1317 Median :1325 Median : -2.000
## DEN : 58236 Mean :1326 Mean :1331 Mean : 8.398
## CLT : 56523 3rd Qu.:1730 3rd Qu.:1738 3rd Qu.: 5.000
## LAX : 54114 Max. :2359 Max. :2400 Max. :2109.000
## (Other):1360204
## DepDelayMinutes DepDel15 DepartureDelayGroups DepTimeBlk
## Min. : 0.00 0:1474211 -1 :1082733 0600-0659: 127374
## 1st Qu.: 0.00 1: 300607 0 : 380323 0700-0759: 120254
## Median : 0.00 1 : 112659 0800-0859: 117440
## Mean : 11.69 2 : 57943 1700-1759: 114802
## 3rd Qu.: 5.00 3 : 35380 1200-1259: 113553
## Max. :2109.00 4 : 23190 1100-1159: 111446
## (Other): 82590 (Other) :1069949
## CRSArrTime ArrTime ArrDelay ArrDelayMinutes
## Min. : 1 Min. : 1 Min. :-120.000 Min. : 0.00
## 1st Qu.:1105 1st Qu.:1056 1st Qu.: -14.000 1st Qu.: 0.00
## Median :1516 Median :1510 Median : -6.000 Median : 0.00
## Mean :1491 Mean :1476 Mean : 3.791 Mean : 12.14
## 3rd Qu.:1919 3rd Qu.:1916 3rd Qu.: 7.000 3rd Qu.: 7.00
## Max. :2400 Max. :2400 Max. :2153.000 Max. :2153.00
##
## ArrDel15 ArrivalDelayGroups ArrTimeBlk CRSElapsedTime
## 0:1452702 -1 :719858 1600-1659: 111087 Min. :-99.0
## 1: 322116 -2 :389134 1000-1059: 110887 1st Qu.: 88.0
## 0 :343710 1300-1359: 110013 Median :122.0
## 1 :128576 1900-1959: 109082 Mean :140.9
## 2 : 62021 1800-1859: 108528 3rd Qu.:170.0
## 3 : 36161 1700-1759: 108492 Max. :703.0
## (Other): 95358 (Other) :1116729
## ActualElapsedTime AirTime Distance
## Min. : 14.0 Min. : 7.0 Min. : 31.0
## 1st Qu.: 84.0 1st Qu.: 60.0 1st Qu.: 363.0
## Median :118.0 Median : 92.0 Median : 631.0
## Mean :136.2 Mean :111.3 Mean : 797.1
## 3rd Qu.:166.0 3rd Qu.:141.0 3rd Qu.:1028.0
## Max. :723.0 Max. :695.0 Max. :4983.0
##
## DepStatus ArrStatus
## OntimeDep :1186200 OntimeArr :1145617
## DelayedonDeparture: 588618 DelayedonArrival: 629201
##
##
##
##
##
## airline timewindowdep timewindowarr daywindow
## Others :399193 AM: 740818 AM: 540856 Weekday:1302690
## Southwest:336886 PM:1034000 PM:1233962 Weekend: 472128
## Delta :232804
## American :226578
## Skywest :190815
## United :157853
## (Other) :230689
Discrete Data Analysis
Proportion of On Time/Delayed Flights
## ArrStatus
## OntimeArr DelayedonArrival
## 64.55 35.45
## Visualization
df %>% ggplot(aes(x=ArrStatus, fill=ArrStatus)) + geom_histogram(stat="count") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Proportion of On Time/Delayed flights grouped by airline
## ArrStatus
## airline OntimeArr DelayedonArrival
## Alaska 2.26 1.24
## American 7.81 4.95
## Delta 9.58 3.54
## Frontier 0.94 0.75
## Hawaiian 0.84 0.35
## Jetblue 2.47 1.72
## Others 14.36 8.13
## Skywest 7.02 3.73
## Southwest 11.94 7.04
## Spirit 1.71 0.71
## United 5.60 3.29
## Visualization
df %>% ggplot(aes(x=ArrStatus,fill=airline)) + geom_histogram(stat="count", position = 'dodge')Proportion of On Time/Delayed flights grouped by Weekday/Weekend status
## ArrStatus
## daywindow OntimeArr DelayedonArrival
## Weekday 72.43 75.17
## Weekend 27.57 24.83
## Visualization
df %>% ggplot(aes(x=daywindow, fill=ArrStatus)) + geom_histogram(stat = "count", position="dodge")+
scale_fill_manual(values=c("#9F2042","#AEA200"))Average airtime & flight distance for On Time/Delayed flights
Proportion of On Time/Delayed flights grouped by reporting airline
## ArrStatus
## Reporting_Airline OntimeArr DelayedonArrival
## 9E 0.02 0.01
## AA 0.08 0.05
## AS 0.02 0.01
## B6 0.02 0.02
## DL 0.10 0.04
## EV 0.01 0.01
## F9 0.01 0.01
## G4 0.01 0.00
## HA 0.01 0.00
## MQ 0.03 0.02
## NK 0.02 0.01
## OH 0.03 0.01
## OO 0.07 0.04
## UA 0.06 0.03
## WN 0.12 0.07
## YV 0.02 0.01
## YX 0.03 0.01
Proportion of On Time/Delayed flights grouped by time window arrival
## ArrStatus
## timewindowarr OntimeArr DelayedonArrival
## AM 0.70 0.30
## PM 0.62 0.38
## Visualization
df %>% ggplot(aes(x=timewindowarr, fill=ArrStatus)) + geom_histogram(stat = "count", position="dodge") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Proportion of On Time/Delayed flights grouped by time window departure
## ArrStatus
## timewindowdep OntimeArr DelayedonArrival
## AM 0.69 0.31
## PM 0.61 0.39
## Visualization
df %>% ggplot(aes(x=timewindowdep, fill=ArrStatus)) + geom_histogram(stat = "count", position="dodge") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Box plot of On Time/Delayed flights, split by distance
ggplot(data=df) +
geom_boxplot(mapping = aes(x=ArrStatus, y=Distance,fill = ArrStatus)) +
xlab("Flight Arrival Status") +
ylab("Flight Distance") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Box plot of On Time/Delayed flights, split by airtime
ggplot(data=df) +
geom_boxplot(mapping = aes(x=ArrStatus, y=AirTime,fill = ArrStatus)) +
xlab("Flight Arrival Status") +
ylab("Flight Airtime") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Mean plot of On Time/Delayed flights, split by distance
Mean plot of On Time/Delayed flights, split by airtime
Plot for flight arrival status grouped by departure delay groups
ggplot(data =df) +
geom_bar(mapping = aes(x = DepartureDelayGroups, fill=ArrStatus)) +
xlab("Departure Delay Groups") +
ylab("Number of Flights") +
labs("Flight Arrival Status") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Plot for number of flights delayed grouped by departure time block
df %>% filter(ArrStatus == "DelayedonArrival") %>% ggplot() +
geom_bar(mapping = aes(x = DepTimeBlk, fill = DepTimeBlk),position = "dodge") +
xlab("Departure Time Block") +
ylab("Number of Flights Delayed") +
scale_fill_discrete(guide=guide_legend(title="Departure time block",ncol = 2,keywidth = 2)) +
theme(axis.text.x = element_text(angle=90))Plot for number of flights delayed grouped by reporting airline
df %>% filter(ArrStatus == "DelayedonArrival") %>% ggplot() +
geom_bar(mapping = aes(x = Reporting_Airline, fill = Reporting_Airline),position = "dodge") +
xlab("Reporting Airline") +
ylab("Number of Flights Delayed") +
scale_fill_discrete(guide=guide_legend(title="Reporting Airline",ncol = 2,keywidth = 2))Route Analysis
Worst routes (Delayed more than 85% of the time)
df$route <- paste(Origin, Dest, sep="-")
routes_delay <- df %>% group_by(route, ArrStatus) %>% summarise(count=n())
print(routes_delay)## # A tibble: 11,546 x 3
## # Groups: route [5,812]
## route ArrStatus count
## <chr> <ord> <int>
## 1 ABE-ATL OntimeArr 143
## 2 ABE-ATL DelayedonArrival 99
## 3 ABE-CLT OntimeArr 143
## 4 ABE-CLT DelayedonArrival 100
## 5 ABE-DTW OntimeArr 174
## 6 ABE-DTW DelayedonArrival 83
## 7 ABE-FLL OntimeArr 1
## 8 ABE-FLL DelayedonArrival 7
## 9 ABE-MYR OntimeArr 6
## 10 ABE-MYR DelayedonArrival 3
## # ... with 11,536 more rows
## # A tibble: 5,812 x 3
## # Groups: route [5,812]
## route OntimeArr DelayedonArrival
## <chr> <int> <int>
## 1 ABE-ATL 143 99
## 2 ABE-CLT 143 100
## 3 ABE-DTW 174 83
## 4 ABE-FLL 1 7
## 5 ABE-MYR 6 3
## 6 ABE-ORD 72 39
## 7 ABE-PGD 27 10
## 8 ABE-PHL 30 10
## 9 ABE-PIE 29 8
## 10 ABE-SFB 72 29
## # ... with 5,802 more rows
routes_delay <- routes_delay %>% mutate(per_delay=DelayedonArrival*100/(DelayedonArrival+OntimeArr), per_ontime=100-per_delay)
## Routes where flights are delayed more than 85% of the time
routes_delay %>% filter(per_delay>85)## Visualization
routes_delay %>% filter(per_delay>=85) %>% ggplot(aes(x=route, y=per_delay, fill=route)) + geom_bar(stat="identity") + theme_minimal() + theme(axis.text.x = element_text(angle=45)) + ylab("percentage delay")Date Analysis
Total flights On Time/Delayed grouped by month
df %>% mutate(month = format(FlightDate, "%m")) %>%
group_by(month, ArrStatus) %>%
summarise(total = n())Total flights On Time/Delayed grouped by day of month
df %>% mutate(day = format(FlightDate, "%d")) %>%
group_by(day, ArrStatus) %>%
summarise(total = n()) %>%
mutate(flight_prop_for_day = total / sum(total))Continous Data Analysis
Plot for flight arrival status depending on departure delay
df %>% ggplot(aes(x=ArrTime, fill=ArrStatus)) + geom_histogram(position = "dodge", binwidth = 50) + theme_minimal() +
scale_fill_manual(values=c("#9F2042","#AEA200"))Relation between DepartureDelay & ArrDelay
ggplot(data =df, mapping = aes(x = DepDelayMinutes, y=ArrDelayMinutes, color=ArrStatus)) +
geom_point() +
xlab("Departure Delay In Minutes") +
ylab("Arrival Delay in Minutes") +
scale_color_manual(values=c("#9F2042","#AEA200"))Plot for flight arrival status depending on departure delay
ggplot(data =df, mapping = aes(x = DepDelayMinutes, fill=ArrStatus)) +
geom_histogram(boundary = 0, binwidth = 50) +
xlim(c(0, 500)) +
xlab("Departure Delay In Minutes") +
ylab("Number of Flights") +
scale_fill_manual(values=c("#9F2042","#AEA200"))Relation between Distance & ArrDelay
ggplot(data =df, mapping = aes(x=Distance, y = ArrDelayMinutes, color=ArrStatus)) +
geom_point() +
xlab("Flight Distance") +
ylab("Arrival Delay in minutes") +
scale_color_manual(values=c("#9F2042","#AEA200"))Plot for number of flights delayed depending on Airtime
options(scipen=999)
ggplot(data = df) +
geom_histogram(mapping = aes(x = AirTime, fill = ArrStatus), origin = 0, binwidth = 50, position = "dodge") +
xlab("Airtime(minutes)") +
ylab("Number of Flights Delayed") +
xlim(c(0,500)) +
scale_fill_manual(values=c("#9F2042","#AEA200"),guide=guide_legend(title="Arrival Status"))Correlation
Correlation matrix between important continuous variables
## AirTime Distance ActualElapsedTime
## AirTime 1.000000000 0.9812425334 0.98788870
## Distance 0.981242533 1.0000000000 0.96674226
## ActualElapsedTime 0.987888697 0.9667422574 1.00000000
## ArrDelayMinutes 0.010204164 -0.0005893142 0.03899857
## DepDelayMinutes 0.004144378 0.0045386600 0.01216335
## ArrDelayMinutes DepDelayMinutes
## AirTime 0.0102041639 0.004144378
## Distance -0.0005893142 0.004538660
## ActualElapsedTime 0.0389985688 0.012163348
## ArrDelayMinutes 1.0000000000 0.974960662
## DepDelayMinutes 0.9749606618 1.000000000
Analysis Summary
- Around 35% of all flights are delayed which is an important issue to be analyzed further.
- The delay is dependent on the airline, where Southwest & Other airlines are delayed the most. Sprint, Alaska, American are less delayed than others.
- Flights are delayed more on Weekdays than on weekends.
- Flight delay doesn’t quite depend on the airtime or flight distance.
- Flights are delayed more when they Arrive post 12 PM & Depart post 12 PM.
- Departure delay groups of -1,0,1,2,3 contribute to more number of delays.
- Flights are delayed more during the departure time block of 1600 - 2000 HRS.
- Some reporting airlines(WN, AA, DL) contribute to more delay than others.
- For 15 routes the delay of the flights is more than 85%.
- For 4 routes the delay of the flights is less than 5%.
- There is no significant difference in flight delays based on month of the year.
- 31st of the month is the day when least number of delays occur.
- Departure Delay & Arrival Delay are linearly related.
- Arrival Delays are reduced with increase in Departure Delays.
- AirTime, Distance, ActualElapsedTime are highly correlated. And same is with ArrDelayMinutes & DepartureDelayMinutes.