In this assignment, I will try to represent the flight for the top busiest airport. I will also try to look at different airport and decide which one is the best in term of departing,connection that means the one that have less delay recorded delay on departure and arrival.
Here I will focus on importing the datase and I will perform some manipulation in order for the data to be ready for use for our analysis.
## FlightDate Carrier TailNum FlightNum
## 1/3/2016 : 15878 WN :104154 : 4244 Min. : 1
## 1/4/2016 : 15570 AA : 75580 N489HA : 358 1st Qu.: 702
## 1/15/2016: 15308 DL : 69711 N488HA : 354 Median :1594
## 1/14/2016: 15295 OO : 47619 N490HA : 340 Mean :2079
## 1/29/2016: 15293 EV : 41970 N479HA : 337 3rd Qu.:2763
## 1/22/2016: 15290 UA : 39761 N484HA : 335 Max. :7438
## (Other) :353193 (Other): 67032 (Other):439859
## Origin OriginCityName OriginState
## ATL : 29870 Atlanta, GA : 29870 CA : 56245
## ORD : 18610 Chicago, IL : 25289 TX : 48181
## DEN : 17519 Denver, CO : 17519 FL : 40042
## DFW : 16565 Dallas/Fort Worth, TX: 16565 GA : 30927
## LAX : 16427 Los Angeles, CA : 16427 IL : 25987
## SFO : 13207 Houston, TX : 16200 NY : 20535
## (Other):333629 (Other) :323957 (Other):223910
## Dest DestCityName DestState
## ATL : 29854 Atlanta, GA : 29854 CA : 56225
## ORD : 18605 Chicago, IL : 25281 TX : 48190
## DEN : 17527 Denver, CO : 17527 FL : 40000
## DFW : 16572 Dallas/Fort Worth, TX: 16572 GA : 30912
## LAX : 16406 Los Angeles, CA : 16406 IL : 25984
## SFO : 13206 Houston, TX : 16188 NY : 20531
## (Other):333657 (Other) :323999 (Other):223985
## CRSDepTime DepTime WheelsOff WheelsOn
## Min. : 1 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 920 1st Qu.: 924 1st Qu.: 939 1st Qu.:1104
## Median :1325 Median :1331 Median :1344 Median :1519
## Mean :1330 Mean :1334 Mean :1357 Mean :1483
## 3rd Qu.:1730 3rd Qu.:1737 3rd Qu.:1750 3rd Qu.:1914
## Max. :2359 Max. :2400 Max. :2400 Max. :2400
## NA's :11473 NA's :11600 NA's :11907
## CRSArrTime ArrTime Cancelled Diverted
## Min. : 1 Min. : 1 Min. :0.00000 Min. :0.000000
## 1st Qu.:1118 1st Qu.:1108 1st Qu.:0.00000 1st Qu.:0.000000
## Median :1527 Median :1522 Median :0.00000 Median :0.000000
## Mean :1503 Mean :1488 Mean :0.02616 Mean :0.001938
## 3rd Qu.:1920 3rd Qu.:1919 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :2359 Max. :2400 Max. :1.00000 Max. :1.000000
## NA's :11907
## CRSElapsedTime ActualElapsedTime Distance
## Min. : 21.0 Min. : 15.0 Min. : 31.0
## 1st Qu.: 90.0 1st Qu.: 85.0 1st Qu.: 391.0
## Median :128.0 Median :122.0 Median : 679.0
## Mean :146.5 Mean :140.1 Mean : 844.2
## 3rd Qu.:180.0 3rd Qu.:173.0 3rd Qu.:1086.0
## Max. :705.0 Max. :721.0 Max. :4983.0
## NA's :12529
Here we will select the dataset needed, as we are focusing only on top 15 busiest aiport, I will subset the dataset to only those airport. For the busiest aiport I selected Atlanta(ATL), Chicago(ORD), Los Angeles(LAX), Dallas(DFW),New York(JFK), Denver(DEN),San Francisco(SFO),Las Vegas(LAS), Phoenix(PHX), Houston(IAH), Charlotte(CLT), Miami(MIA), Orlando(MCO),Newark(EWR), Seattle(SEA)
Here I will try to check if there is any missing values on this dataset and probably get their number and clean it so we can have a dataset ready to work on.
## 'data.frame': 201465 obs. of 21 variables:
## $ FlightDate : Factor w/ 31 levels "1/1/2016","1/10/2016",..: 28 29 30 31 2 3 4 5 6 7 ...
## $ Carrier : Factor w/ 12 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ TailNum : Factor w/ 4239 levels "","7819A","7820L",..: 1826 1553 1972 1759 1571 1665 1826 2075 1657 1839 ...
## $ FlightNum : int 43 43 43 43 43 43 43 43 43 43 ...
## $ Origin : Factor w/ 294 levels "ABE","ABQ","ABR",..: 77 77 77 77 77 77 77 77 77 77 ...
## $ OriginCityName : Factor w/ 290 levels "Aberdeen, SD",..: 70 70 70 70 70 70 70 70 70 70 ...
## $ OriginState : Factor w/ 52 levels "AK","AL","AR",..: 44 44 44 44 44 44 44 44 44 44 ...
## $ Dest : Factor w/ 294 levels "ABE","ABQ","ABR",..: 82 82 82 82 82 82 82 82 82 82 ...
## $ DestCityName : Factor w/ 290 levels "Aberdeen, SD",..: 76 76 76 76 76 76 76 76 76 76 ...
## $ DestState : Factor w/ 52 levels "AK","AL","AR",..: 21 21 21 21 21 21 21 21 21 21 ...
## $ CRSDepTime : int 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ DepTime : int 1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
## $ WheelsOff : int 1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
## $ WheelsOn : int 1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
## $ CRSArrTime : int 1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
## $ ArrTime : int 1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 158 158 158 158 158 158 158 158 158 158 ...
## $ ActualElapsedTime: int 155 150 170 151 171 148 159 156 150 158 ...
## $ Distance : int 986 986 986 986 986 986 986 986 986 986 ...
We will check the number of flights cancelled and delete it as we will not use it.
##
##
## +----------------+--------+
## | Cancelled == 1 | n |
## +================+========+
## | FALSE | 196535 |
## +----------------+--------+
## | TRUE | 4930 |
## +----------------+--------+
Here we will graph the flight cancelled by carier for our dataset and we can technically see that AA(American Airline) recorded more cancellation and the carrier with less cancellation for this data is HA(Hawaiian Airlines). It looks like American Airline has recorded the highest number of cancelation.
We will finish the data calculation as in the lecture notes with some calculation and we will print out different values as needed. Here are the calculation we will perform:
* DepDelay = new_DepTime - new_CRSDepTime
* DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay)
* DepDel30 = ifelse(DepDelay >= 30, 1, 0)
* DepDel60 = ifelse(DepDelay >= 60, 1, 0)
* TaxiOut = new_WheelsOff - new_DepTime
* TaxiIn = new_ArrTime - new_WheelsOn
* ArrDelay = new_ArrTime - new_ArrDepTime
* ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay)
* ArrDel30 = ifelse(ArrDelay >= 30, 1, 0)
* ArrDel60 = ifelse(ArrDelay >= 60, 1, 0)
* FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime
* AirTime = ActualElapsedTime = TaxiOut - TaxiIn
* AirSpeed = Distance / (Airtime / 60)
Here we want to have the average delay by carrier and then we will focus on calculating the carrier recording the most delay on all destination. We will classify the delay by less then 30 minutes, less than 1 hour and greater then 1 hour. We will determine which carrier is the worse in term on delay and which one is the best based on this data. We will focus also to find the day of the week where there is mostly delay from different flights then we will find which airport is most likely having the highest delay.
## # A tibble: 12 x 3
## Carrier DepDelay Number_Flight
## <fct> <dbl> <int>
## 1 F9 38.8 1151
## 2 OO 38.5 5840
## 3 HA 37.2 83
## 4 NK 36.1 2193
## 5 VX 35.5 1666
## 6 B6 34.0 3480
## 7 EV 32.8 4735
## 8 AA 29.6 16115
## 9 UA 29.5 11633
## 10 DL 26.0 11533
## 11 AS 24.2 1750
## 12 WN 22.7 12404
## # A tibble: 12 x 3
## Carrier DepDelay30 Number_Flight
## <fct> <dbl> <int>
## 1 HA 232. 11
## 2 F9 93.7 390
## 3 UA 83.9 3140
## 4 DL 82.5 2589
## 5 VX 82.1 586
## 6 OO 80.3 2307
## 7 AA 80.0 4527
## 8 EV 77.7 1555
## 9 B6 77.4 1189
## 10 NK 76.6 813
## 11 AS 73.8 410
## 12 WN 68.4 2718
## # A tibble: 12 x 3
## Carrier DepDelay60 Number_Flight
## <fct> <dbl> <int>
## 1 HA 388. 6
## 2 DL 131. 1167
## 3 F9 130. 225
## 4 UA 120. 1655
## 5 AA 119. 2191
## 6 AS 116. 175
## 7 EV 115. 742
## 8 B6 114. 577
## 9 VX 113. 321
## 10 OO 111. 1253
## 11 NK 109. 412
## 12 WN 104. 1146
## # A tibble: 15 x 3
## Origin_Airport DepDelay_Airport Number_Flight
## <fct> <dbl> <int>
## 1 SFO 41.4 277
## 2 SEA 40.0 268
## 3 LAS 36.6 312
## 4 EWR 36.4 217
## 5 JFK 36.0 128
## 6 ORD 30.9 287
## 7 DFW 30.0 225
## 8 MIA 29.9 118
## 9 MCO 29.6 244
## 10 CLT 28.2 175
## 11 PHX 28.1 237
## 12 DEN 27.9 249
## 13 ATL 27.6 249
## 14 LAX 27.3 298
## 15 IAH 26.9 209
## # A tibble: 15 x 3
## Origin_Airport DepDelay30_Airport Number_Flight
## <fct> <dbl> <int>
## 1 SEA 135. 182
## 2 SFO 97.1 246
## 3 EWR 91.9 160
## 4 LAS 88.7 225
## 5 DEN 83.7 195
## 6 JFK 83.1 114
## 7 MCO 80.5 205
## 8 DFW 78.0 155
## 9 CLT 77.9 98
## 10 ORD 76.4 236
## 11 PHX 76.1 163
## 12 LAX 75.6 248
## 13 MIA 73.8 94
## 14 ATL 73.7 203
## 15 IAH 72.5 148
## # A tibble: 15 x 3
## Origin_Airport DepDelay60_Airport Number_Flight
## <fct> <dbl> <int>
## 1 SEA 194. 139
## 2 SFO 144. 208
## 3 EWR 133. 124
## 4 DEN 124. 162
## 5 MCO 123. 165
## 6 LAS 120. 185
## 7 JFK 118. 100
## 8 CLT 117. 73
## 9 DFW 116. 112
## 10 LAX 116. 202
## 11 ORD 113. 194
## 12 PHX 112. 130
## 13 ATL 112. 156
## 14 MIA 110. 77
## 15 IAH 105. 116
## # A tibble: 7 x 4
## Day Max_Delay Average_Delay Number_Flight
## <ord> <dbl> <dbl> <int>
## 1 Sun 520 26.5 493
## 2 Mon 1433 37.7 433
## 3 Tue 375 29.7 358
## 4 Wed 1435 42.4 309
## 5 Thu 265 21.3 311
## 6 Fri 1435 25.5 489
## 7 Sat 1435 42.1 504
Based on the result, Friday, Saturday and Sunday are days with highest number of flight but Saturday also recorded an average of 42 minutes of delay. Monday and Wednesday have a highest average delay 37 and 42 minutes respectively. We can conclude that Thursday is a good day to travel on this Airport with an average of delays less than 22 minutes, but Saturday, Monday and Wednesday are the worst day to travel based on their average delay and they are from F9 (Frontier Airline) and they are ocuring mostly at SEA Airport.