Introduction

In this assignment, I will try to represent the flight for the top busiest airport. I will also try to look at different airport and decide which one is the best in term of departing,connection that means the one that have less delay recorded delay on departure and arrival.

Dataset

Here I will focus on importing the datase and I will perform some manipulation in order for the data to be ready for use for our analysis.

##      FlightDate        Carrier          TailNum         FlightNum   
##  1/3/2016 : 15878   WN     :104154          :  4244   Min.   :   1  
##  1/4/2016 : 15570   AA     : 75580   N489HA :   358   1st Qu.: 702  
##  1/15/2016: 15308   DL     : 69711   N488HA :   354   Median :1594  
##  1/14/2016: 15295   OO     : 47619   N490HA :   340   Mean   :2079  
##  1/29/2016: 15293   EV     : 41970   N479HA :   337   3rd Qu.:2763  
##  1/22/2016: 15290   UA     : 39761   N484HA :   335   Max.   :7438  
##  (Other)  :353193   (Other): 67032   (Other):439859                 
##      Origin                     OriginCityName    OriginState    
##  ATL    : 29870   Atlanta, GA          : 29870   CA     : 56245  
##  ORD    : 18610   Chicago, IL          : 25289   TX     : 48181  
##  DEN    : 17519   Denver, CO           : 17519   FL     : 40042  
##  DFW    : 16565   Dallas/Fort Worth, TX: 16565   GA     : 30927  
##  LAX    : 16427   Los Angeles, CA      : 16427   IL     : 25987  
##  SFO    : 13207   Houston, TX          : 16200   NY     : 20535  
##  (Other):333629   (Other)              :323957   (Other):223910  
##       Dest                       DestCityName      DestState     
##  ATL    : 29854   Atlanta, GA          : 29854   CA     : 56225  
##  ORD    : 18605   Chicago, IL          : 25281   TX     : 48190  
##  DEN    : 17527   Denver, CO           : 17527   FL     : 40000  
##  DFW    : 16572   Dallas/Fort Worth, TX: 16572   GA     : 30912  
##  LAX    : 16406   Los Angeles, CA      : 16406   IL     : 25984  
##  SFO    : 13206   Houston, TX          : 16188   NY     : 20531  
##  (Other):333657   (Other)              :323999   (Other):223985  
##    CRSDepTime      DepTime        WheelsOff        WheelsOn    
##  Min.   :   1   Min.   :   1    Min.   :   1    Min.   :   1   
##  1st Qu.: 920   1st Qu.: 924    1st Qu.: 939    1st Qu.:1104   
##  Median :1325   Median :1331    Median :1344    Median :1519   
##  Mean   :1330   Mean   :1334    Mean   :1357    Mean   :1483   
##  3rd Qu.:1730   3rd Qu.:1737    3rd Qu.:1750    3rd Qu.:1914   
##  Max.   :2359   Max.   :2400    Max.   :2400    Max.   :2400   
##                 NA's   :11473   NA's   :11600   NA's   :11907  
##    CRSArrTime      ArrTime        Cancelled          Diverted       
##  Min.   :   1   Min.   :   1    Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:1118   1st Qu.:1108    1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :1527   Median :1522    Median :0.00000   Median :0.000000  
##  Mean   :1503   Mean   :1488    Mean   :0.02616   Mean   :0.001938  
##  3rd Qu.:1920   3rd Qu.:1919    3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :2359   Max.   :2400    Max.   :1.00000   Max.   :1.000000  
##                 NA's   :11907                                       
##  CRSElapsedTime  ActualElapsedTime    Distance     
##  Min.   : 21.0   Min.   : 15.0     Min.   :  31.0  
##  1st Qu.: 90.0   1st Qu.: 85.0     1st Qu.: 391.0  
##  Median :128.0   Median :122.0     Median : 679.0  
##  Mean   :146.5   Mean   :140.1     Mean   : 844.2  
##  3rd Qu.:180.0   3rd Qu.:173.0     3rd Qu.:1086.0  
##  Max.   :705.0   Max.   :721.0     Max.   :4983.0  
##                  NA's   :12529

Reducing Dataset

Here we will select the dataset needed, as we are focusing only on top 15 busiest aiport, I will subset the dataset to only those airport. For the busiest aiport I selected Atlanta(ATL), Chicago(ORD), Los Angeles(LAX), Dallas(DFW),New York(JFK), Denver(DEN),San Francisco(SFO),Las Vegas(LAS), Phoenix(PHX), Houston(IAH), Charlotte(CLT), Miami(MIA), Orlando(MCO),Newark(EWR), Seattle(SEA)

Checking Missing Values

Here I will try to check if there is any missing values on this dataset and probably get their number and clean it so we can have a dataset ready to work on.

## 'data.frame':    201465 obs. of  21 variables:
##  $ FlightDate       : Factor w/ 31 levels "1/1/2016","1/10/2016",..: 28 29 30 31 2 3 4 5 6 7 ...
##  $ Carrier          : Factor w/ 12 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TailNum          : Factor w/ 4239 levels "","7819A","7820L",..: 1826 1553 1972 1759 1571 1665 1826 2075 1657 1839 ...
##  $ FlightNum        : int  43 43 43 43 43 43 43 43 43 43 ...
##  $ Origin           : Factor w/ 294 levels "ABE","ABQ","ABR",..: 77 77 77 77 77 77 77 77 77 77 ...
##  $ OriginCityName   : Factor w/ 290 levels "Aberdeen, SD",..: 70 70 70 70 70 70 70 70 70 70 ...
##  $ OriginState      : Factor w/ 52 levels "AK","AL","AR",..: 44 44 44 44 44 44 44 44 44 44 ...
##  $ Dest             : Factor w/ 294 levels "ABE","ABQ","ABR",..: 82 82 82 82 82 82 82 82 82 82 ...
##  $ DestCityName     : Factor w/ 290 levels "Aberdeen, SD",..: 76 76 76 76 76 76 76 76 76 76 ...
##  $ DestState        : Factor w/ 52 levels "AK","AL","AR",..: 21 21 21 21 21 21 21 21 21 21 ...
##  $ CRSDepTime       : int  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ DepTime          : int  1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
##  $ WheelsOff        : int  1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
##  $ WheelsOn         : int  1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
##  $ CRSArrTime       : int  1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
##  $ ArrTime          : int  1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  158 158 158 158 158 158 158 158 158 158 ...
##  $ ActualElapsedTime: int  155 150 170 151 171 148 159 156 150 158 ...
##  $ Distance         : int  986 986 986 986 986 986 986 986 986 986 ...

Flight Cancelled

Number of Flight Cancelled

We will check the number of flights cancelled and delete it as we will not use it.

## 
## 
## +----------------+--------+
## | Cancelled == 1 |   n    |
## +================+========+
## |     FALSE      | 196535 |
## +----------------+--------+
## |      TRUE      |  4930  |
## +----------------+--------+

Graph Flight Cancelled

Here we will graph the flight cancelled by carier for our dataset and we can technically see that AA(American Airline) recorded more cancellation and the carrier with less cancellation for this data is HA(Hawaiian Airlines). It looks like American Airline has recorded the highest number of cancelation.

Data Transformation and Calculations

Here we will make sure that the data are ready for manipulation, converting the format date.
Here we are converting the CRSDeptTime to a format date.
Using the lecture note code we will convert here to a date-time format:CRSArrTime, DepTime, WheelsOff, WheelsOn, ArrTime.

Additional Calculations

We will finish the data calculation as in the lecture notes with some calculation and we will print out different values as needed. Here are the calculation we will perform:
* DepDelay = new_DepTime - new_CRSDepTime
* DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay)
* DepDel30 = ifelse(DepDelay >= 30, 1, 0)
* DepDel60 = ifelse(DepDelay >= 60, 1, 0)
* TaxiOut = new_WheelsOff - new_DepTime
* TaxiIn = new_ArrTime - new_WheelsOn
* ArrDelay = new_ArrTime - new_ArrDepTime
* ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay)
* ArrDel30 = ifelse(ArrDelay >= 30, 1, 0)
* ArrDel60 = ifelse(ArrDelay >= 60, 1, 0)
* FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime
* AirTime = ActualElapsedTime = TaxiOut - TaxiIn
* AirSpeed = Distance / (Airtime / 60)

Delay

Here we want to have the average delay by carrier and then we will focus on calculating the carrier recording the most delay on all destination. We will classify the delay by less then 30 minutes, less than 1 hour and greater then 1 hour. We will determine which carrier is the worse in term on delay and which one is the best based on this data. We will focus also to find the day of the week where there is mostly delay from different flights then we will find which airport is most likely having the highest delay.

Carrier Delay (Average)

## # A tibble: 12 x 3
##    Carrier DepDelay Number_Flight
##    <fct>      <dbl>         <int>
##  1 F9          38.8          1151
##  2 OO          38.5          5840
##  3 HA          37.2            83
##  4 NK          36.1          2193
##  5 VX          35.5          1666
##  6 B6          34.0          3480
##  7 EV          32.8          4735
##  8 AA          29.6         16115
##  9 UA          29.5         11633
## 10 DL          26.0         11533
## 11 AS          24.2          1750
## 12 WN          22.7         12404

Carrier Delay greater and equal to 30 minutes

## # A tibble: 12 x 3
##    Carrier DepDelay30 Number_Flight
##    <fct>        <dbl>         <int>
##  1 HA           232.             11
##  2 F9            93.7           390
##  3 UA            83.9          3140
##  4 DL            82.5          2589
##  5 VX            82.1           586
##  6 OO            80.3          2307
##  7 AA            80.0          4527
##  8 EV            77.7          1555
##  9 B6            77.4          1189
## 10 NK            76.6           813
## 11 AS            73.8           410
## 12 WN            68.4          2718

Carrier Delay greater and equal to 60 minutes

## # A tibble: 12 x 3
##    Carrier DepDelay60 Number_Flight
##    <fct>        <dbl>         <int>
##  1 HA            388.             6
##  2 DL            131.          1167
##  3 F9            130.           225
##  4 UA            120.          1655
##  5 AA            119.          2191
##  6 AS            116.           175
##  7 EV            115.           742
##  8 B6            114.           577
##  9 VX            113.           321
## 10 OO            111.          1253
## 11 NK            109.           412
## 12 WN            104.          1146

Airport Delay (Average)

## # A tibble: 15 x 3
##    Origin_Airport DepDelay_Airport Number_Flight
##    <fct>                     <dbl>         <int>
##  1 SFO                        41.4           277
##  2 SEA                        40.0           268
##  3 LAS                        36.6           312
##  4 EWR                        36.4           217
##  5 JFK                        36.0           128
##  6 ORD                        30.9           287
##  7 DFW                        30.0           225
##  8 MIA                        29.9           118
##  9 MCO                        29.6           244
## 10 CLT                        28.2           175
## 11 PHX                        28.1           237
## 12 DEN                        27.9           249
## 13 ATL                        27.6           249
## 14 LAX                        27.3           298
## 15 IAH                        26.9           209

Airport Delay greater and equal to 30 minutes

## # A tibble: 15 x 3
##    Origin_Airport DepDelay30_Airport Number_Flight
##    <fct>                       <dbl>         <int>
##  1 SEA                         135.            182
##  2 SFO                          97.1           246
##  3 EWR                          91.9           160
##  4 LAS                          88.7           225
##  5 DEN                          83.7           195
##  6 JFK                          83.1           114
##  7 MCO                          80.5           205
##  8 DFW                          78.0           155
##  9 CLT                          77.9            98
## 10 ORD                          76.4           236
## 11 PHX                          76.1           163
## 12 LAX                          75.6           248
## 13 MIA                          73.8            94
## 14 ATL                          73.7           203
## 15 IAH                          72.5           148

Airport Delay greater and equal to 60 minutes

## # A tibble: 15 x 3
##    Origin_Airport DepDelay60_Airport Number_Flight
##    <fct>                       <dbl>         <int>
##  1 SEA                          194.           139
##  2 SFO                          144.           208
##  3 EWR                          133.           124
##  4 DEN                          124.           162
##  5 MCO                          123.           165
##  6 LAS                          120.           185
##  7 JFK                          118.           100
##  8 CLT                          117.            73
##  9 DFW                          116.           112
## 10 LAX                          116.           202
## 11 ORD                          113.           194
## 12 PHX                          112.           130
## 13 ATL                          112.           156
## 14 MIA                          110.            77
## 15 IAH                          105.           116

Carrier Delay by Day

Conclusion

## # A tibble: 7 x 4
##   Day   Max_Delay Average_Delay Number_Flight
##   <ord>     <dbl>         <dbl>         <int>
## 1 Sun         520          26.5           493
## 2 Mon        1433          37.7           433
## 3 Tue         375          29.7           358
## 4 Wed        1435          42.4           309
## 5 Thu         265          21.3           311
## 6 Fri        1435          25.5           489
## 7 Sat        1435          42.1           504

Based on the result, Friday, Saturday and Sunday are days with highest number of flight but Saturday also recorded an average of 42 minutes of delay. Monday and Wednesday have a highest average delay 37 and 42 minutes respectively. We can conclude that Thursday is a good day to travel on this Airport with an average of delays less than 22 minutes, but Saturday, Monday and Wednesday are the worst day to travel based on their average delay and they are from F9 (Frontier Airline) and they are ocuring mostly at SEA Airport.