1. INTRODUCTION

US Airways (formerly known as USAir) was a major American airline that ceased to operate independently when the Federal Aviation Administration granted a single operating certificate (SOC) for US Airways and American Airlines on April 8, 2015. Publicly, the two carriers appeared to merge when their reservations systems and booking processes were merged on October 17, 2015; however, other systems were still separate at that time. The airline had an extensive international and domestic network, with 193 destinations in 24 countries in North America, South America, Europe and the Middle East. The airline was a member of the Star Alliance, before becoming an affiliate member of Oneworld in March 2014. US Airways utilized a fleet of 343 mainline jet aircraft, as well as 278 regional jet and turbo-prop aircraft operated by contract and subsidiary airlines under the name US Airways Express via code sharing agreements.This paper addresses the issues concerning the delaying of flights.In this paper, we evaluate various factors which can be responsible for the delaying of flights.

2. OVERVIEW OF THE STUDY

Our field study concerns with the various factors causing the delay of USAirways’ flights. The dataset comprises of factors such as departure time, delay time, arrival delay time, elapsed time, and factors like weather, security, late aircraft, carrier and NAS(National Aviation System) delays.We empirically study how these mentioned factors influences the delay of flights.Our Regression Analysis reveals that the departure delay of flights are highly caused due to the delayed arrival of flights, NAS delays,departutre time and on the elapsed time to the flight.Our analysis of the USAirways flights indicates a significant “delaying of flights”.

2.1.DATA DESCRIPTION

For this study, the data collected has about 33 variables consisting of flight number,date, day and month of the flights departing and arriving, with their original and delayed departure and arrival time. The data also consist of variables which can be the cause of delayed flights like weather, security, nas and carrier delays.The data consists about 700 rows approximately.The airlines report the causes of delay in broad categories that were created by the Air Carrier On-Time Reporting Advisory Committee. The categories are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft and Security. The causes of cancellation are the same, except there is no late-arriving aircraft category.

Air Carrier: The cause of the cancellation or delay was due to circumstances within the airline’s control (e.g. maintenance or crew problems, aircraft cleaning, baggage loading, fueling, etc.).

Extreme Weather: Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as tornado, blizzard or hurricane.

National Aviation System (NAS): Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.

Late-arriving aircraft: A previous flight with same aircraft arrived late, causing the present flight to depart late.

Security: Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.

3. MODEL ANALYSIS

HYPOTHESIS H1: THE TIME OF DEPARTURE INCREASES AS THE THE TIME OF ARRIVAL INCREASES.

In order to test the above Hypothesis, we proposed the following model :

DepartureDelay = a0 + a1ArrivalDelay + a2NasDelay + a3WeatherDelay + a4ActualElapsedTime + a5Airtime + a6Distance + a7ArrivalTime + a8DepartureTime.

3.1. RESULTS

We established the factors which are causing the delay in flights. We estimated model, using linear least squares. If there was a delay in flights, we expected to find the coefficients of DepDelay to be positive.

We found empirical support for H1. The departure delay of flights depends on the delay of arrival time of the flights.

THE COEFFICIENTS, “ActualElapsedTime”, “ArrivalDelay”,“AirTIme” and “Distance” ARE STATISTICALLY SIGNIFICANT WITH THE DELAY IN DEPARTURES OF THE FLIGHTS.

The regression coefficient (21.54) is significantly dfferent from zero (p < 0.001)

There is an expected delay of deparutre of 21 min for every 10 min delay in the arrival of flights.

THE MULTIPLE R-SQUARED (0.9578) INDICATES THAT THE MODEL ACCOUNTS FOR 95.78% OF THE VARIANCE IN THE DEPARTURE DELAYS

THE ADJUSTED R-SQUARED ( 0.9557) INDICATEs THAT THE MODEL IS 95.57% LINEARLY FITTED.

THE RESIDUAL STANDARD ERROR (9.296) CAN BE THOUGHT OF AS THE AVERAGE ERROR IN PREDICTING THE DELAY IN DEPARTURE OF FLIGHTS USING THIS MODEL

THE F-STATISITCS PREDICT THAT THE MODEL IS HIGHLY SIGNIFICANT AS P-VALUE IS 2.2e-16 (p< 0.001)

4. CONCLUSION

This Project is motivated by the need for research that could improve our understanding of how various factors influence the delay of flights in USAirways.We found that the delay in arrival time of the flights causes the delay in the departure time of the flights.

5. REFERENCES

[1] IMS PRO SCHOOL CASESTUDY

[2] https://old.datahub.io/dataset/us-airline-on-time-performance

[3] https://www.kaggle.com/giovamata/airlinedelaycauses

[4] https://en.wikipedia.org/wiki/US_Airways

[5] https://www.rita.dot.gov/bts/help/aviation/html/understanding.html

6. APPENDIX (SOURCE CODE)

6.1 PRELIMINARY WORK

6.1.1 LIBRARIES

library(psych)
library("lattice")
library("gplots")
library("car")
library("corrplot")
library(leaps)
library("corrgram")
library(Hmisc)

6.1.2 Reading the raw data into a dataframe

setwd("C:/Users/Bagga/Desktop/Internship 2018/Project")

start.time = Sys.time()
flight  = read.csv("Airlines.csv", header = TRUE)
end.time  = Sys.time()   # To check the time required to train the model
end.time - start.time
## Time difference of 0.015625 secs
describe(flight)
## flight 
## 
##  33  Variables      700  Observations
## ---------------------------------------------------------------------------
## X.3 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      700        1     3550     2320    350.9    708.9 
##      .25      .50      .75      .90      .95 
##   1781.8   3676.5   5296.0   6244.0   6618.1 
## 
## lowest :   21   25   42   46   56, highest: 6893 6896 6917 6934 6942
## ---------------------------------------------------------------------------
## X.2 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      700        1    34199    23192     4244     7143 
##      .25      .50      .75      .90      .95 
##    16393    34017    51941    62765    66105 
## 
## lowest :    85   166   228   435   501, highest: 69771 69809 69833 70026 70050
## ---------------------------------------------------------------------------
## X.1 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      700        1   353264   236796    37528    65595 
##      .25      .50      .75      .90      .95 
##   175505   356573   528212   631514   666984 
## 
## lowest :     67   2267   2268   2728   2985, highest: 698042 699022 699629 699775 700332
## ---------------------------------------------------------------------------
## X 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      700        1  3430628  2324322   295527   662373 
##      .25      .50      .75      .90      .95 
##  1734461  3415473  5115834  6266653  6599109 
## 
## lowest :    2610   13514   17400   44951   45971
## highest: 6924792 6935361 6952119 6975314 6980407
## ---------------------------------------------------------------------------
## Year 
##        n  missing distinct     Info     Mean      Gmd 
##      700        0        1        0     2008        0 
##                
## Value      2008
## Frequency   700
## Proportion    1
## ---------------------------------------------------------------------------
## Month 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0       12    0.993    6.243    3.896        1        2 
##      .25      .50      .75      .90      .95 
##        3        6        9       11       12 
##                                                                       
## Value          1     2     3     4     5     6     7     8     9    10
## Frequency     64    51    70    69    61    51    72    62    46    54
## Proportion 0.091 0.073 0.100 0.099 0.087 0.073 0.103 0.089 0.066 0.077
##                       
## Value         11    12
## Frequency     47    53
## Proportion 0.067 0.076
## ---------------------------------------------------------------------------
## DayofMonth 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0       31    0.999    15.37    10.46        2        3 
##      .25      .50      .75      .90      .95 
##        7       16       23       28       29 
## 
## lowest :  1  2  3  4  5, highest: 27 28 29 30 31
## ---------------------------------------------------------------------------
## DayOfWeek 
##        n  missing distinct     Info     Mean      Gmd 
##      700        0        7    0.979    3.896    2.208 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency     96    99   121   112   102    78    92
## Proportion 0.137 0.141 0.173 0.160 0.146 0.111 0.131
## ---------------------------------------------------------------------------
## DepTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      687       13      494        1     1374    533.6    640.3    736.2 
##      .25      .50      .75      .90      .95 
##   1010.0   1355.0   1730.5   2003.8   2121.7 
## 
## lowest :    2  530  538  541  542, highest: 2310 2314 2321 2340 2355
## ---------------------------------------------------------------------------
## CRSDepTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      317        1     1362    515.3      645      740 
##      .25      .50      .75      .90      .95 
##     1014     1342     1719     1955     2105 
## 
## lowest :  535  540  545  557  600, highest: 2240 2254 2300 2315 2355
## ---------------------------------------------------------------------------
## ArrTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      687       13      501        1     1515    556.5    749.6    853.0 
##      .25      .50      .75      .90      .95 
##   1152.0   1543.0   1911.0   2138.4   2233.0 
## 
## lowest :    3    4   17   23   34, highest: 2343 2346 2350 2351 2356
## ---------------------------------------------------------------------------
## CRSArrTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      448        1     1534    527.8    819.9    912.7 
##      .25      .50      .75      .90      .95 
##   1209.0   1550.5   1910.0   2145.8   2235.2 
## 
## lowest :    4    5   20   40  137, highest: 2350 2354 2355 2358 2359
## ---------------------------------------------------------------------------
## UniqueCarrier 
##        n  missing distinct 
##      700        0       20 
##                                                                       
## Value         9E    AA    AQ    AS    B6    CO    DL    EV    F9    FL
## Frequency     23    49     1    22    32    28    37    21    10    30
## Proportion 0.033 0.070 0.001 0.031 0.046 0.040 0.053 0.030 0.014 0.043
##                                                                       
## Value         HA    MQ    NW    OH    OO    UA    US    WN    XE    YV
## Frequency      8    60    39    17    59    42    47   118    28    29
## Proportion 0.011 0.086 0.056 0.024 0.084 0.060 0.067 0.169 0.040 0.041
## ---------------------------------------------------------------------------
## FlightNum 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      642        1     2156     2112    111.0    231.9 
##      .25      .50      .75      .90      .95 
##    542.8   1470.0   3433.5   5368.0   5920.6 
## 
## lowest :    1    3    8    9   10, highest: 7275 7307 7311 7755 7773
## ---------------------------------------------------------------------------
## TailNum 
##        n  missing distinct 
##      700        0      641 
## 
## lowest :        80139E 80359E 80419E 83909E, highest: N986CA N989AT N989CA N992DL N995AT
## ---------------------------------------------------------------------------
## ActualElapsedTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      686       14      227        1      130    75.75     53.0     61.0 
##      .25      .50      .75      .90      .95 
##     79.0    111.0    161.0    227.5    284.5 
## 
## lowest :  32  35  36  38  39, highest: 403 407 414 422 576
## ---------------------------------------------------------------------------
## CRSElapsedTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      213        1    131.1     74.8    52.95    65.00 
##      .25      .50      .75      .90      .95 
##    80.00   111.00   162.00   229.00   279.00 
## 
## lowest :  30  34  35  37  40, highest: 389 406 410 415 575
## ---------------------------------------------------------------------------
## AirTime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      686       14      223        1    106.4       72     34.0     42.5 
##      .25      .50      .75      .90      .95 
##     57.0     88.0    135.0    197.5    249.5 
## 
## lowest :  15  19  20  21  22, highest: 363 374 378 396 554
## ---------------------------------------------------------------------------
## ArrDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      686       14      126        1     8.57    30.28      -23      -17 
##      .25      .50      .75      .90      .95 
##       -9       -1       14       41       71 
## 
## lowest : -62 -60 -38 -37 -34, highest: 169 177 180 234 264
## ---------------------------------------------------------------------------
## DepDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      687       13      106    0.998    10.17    23.85     -9.0     -7.0 
##      .25      .50      .75      .90      .95 
##     -4.0      0.0     11.0     37.4     72.0 
## 
## lowest : -26 -21 -15 -14 -13, highest: 171 173 175 214 262
## ---------------------------------------------------------------------------
## Origin 
##        n  missing distinct 
##      700        0      134 
## 
## lowest : ABQ ACT ACV ALB ANC, highest: TUL TUS TYS VPS XNA
## ---------------------------------------------------------------------------
## Dest 
##        n  missing distinct 
##      700        0      140 
## 
## lowest : ABI ABQ ALB AMA ANC, highest: TUL TUS TYS WRG XNA
## ---------------------------------------------------------------------------
## Distance 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      700        0      448        1    743.1    594.5    156.0    214.0 
##      .25      .50      .75      .90      .95 
##    330.5    589.0    967.0   1522.0   1979.5 
## 
## lowest :   49   67   74   82   86, highest: 2689 2762 2936 2979 4962
## ---------------------------------------------------------------------------
## TaxiIn 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      687       13       31    0.986    6.844    4.546        2        3 
##      .25      .50      .75      .90      .95 
##        4        6        8       12       15 
## 
## lowest :  1  2  3  4  5, highest: 34 35 39 44 77
## ---------------------------------------------------------------------------
## TaxiOut 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      687       13       56    0.997    16.96    10.49        7        8 
##      .25      .50      .75      .90      .95 
##       10       14       19       28       35 
## 
## lowest :   4   5   6   7   8, highest:  81  88 102 137 152
## ---------------------------------------------------------------------------
## Cancelled 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      700        0        2    0.055       13  0.01857  0.03651 
## 
## ---------------------------------------------------------------------------
## CancellationCode 
##        n  missing distinct 
##      700        0        4 
##                                   
## Value                A     B     C
## Frequency    687     1     8     4
## Proportion 0.981 0.001 0.011 0.006
## ---------------------------------------------------------------------------
## Diverted 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      700        0        2    0.004        1 0.001429 0.002857 
## 
## ---------------------------------------------------------------------------
## CarrierDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      168      532       38    0.756    8.649    14.32      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.0      8.0     30.0     43.6 
## 
## lowest :   0   1   2   3   4, highest:  76  80  94  96 108
## ---------------------------------------------------------------------------
## WeatherDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      168      532       11    0.184    2.065    4.002     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     0.00     0.00     0.00     7.65 
##                                                                       
## Value          0     2     7     8    13    20    30    35    50    72
## Frequency    157     1     1     1     1     1     2     1     1     1
## Proportion 0.935 0.006 0.006 0.006 0.006 0.006 0.012 0.006 0.006 0.006
##                 
## Value         80
## Frequency      1
## Proportion 0.006
## ---------------------------------------------------------------------------
## NASDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      168      532       55    0.947    18.38     26.2     0.00     0.00 
##      .25      .50      .75      .90      .95 
##     0.00     5.50    23.25    51.30    80.95 
## 
## lowest :   0   1   2   3   4, highest:  97 146 154 157 167
## ---------------------------------------------------------------------------
## SecurityDelay 
##        n  missing distinct     Info     Mean      Gmd 
##      168      532        1        0        0        0 
##               
## Value        0
## Frequency  168
## Proportion   1
## ---------------------------------------------------------------------------
## LateAircraftDelay 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      168      532       57      0.9    22.62    33.46      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      4.5     28.0     65.2    101.6 
## 
## lowest :   0   1   3   4   5, highest: 151 173 174 209 242
## ---------------------------------------------------------------------------

6.2 Inspect the datatypes. Convering the data type of some coloumns

6.2.1 Converting day of week

flight$DayOfWeek[flight$DayOfWeek == 1] = 'Mon'
flight$DayOfWeek[flight$DayOfWeek == 2] = 'Tue'
flight$DayOfWeek[flight$DayOfWeek == 3] = 'Wed'
flight$DayOfWeek[flight$DayOfWeek == 4] = 'Thu'
flight$DayOfWeek[flight$DayOfWeek == 5] = 'Fri'
flight$DayOfWeek[flight$DayOfWeek == 6] = 'Sat'
flight$DayOfWeek[flight$DayOfWeek == 7] = 'Sun'
flight$DayOfWeek <- factor(flight$DayOfWeek)

6.2.2 Converting month

flight$Month[flight$Month == 1]  = 'Jan'
flight$Month[flight$Month == 2]  = 'Feb'
flight$Month[flight$Month == 3]  = 'Mar'
flight$Month[flight$Month == 4]  = 'Apr'
flight$Month[flight$Month == 5]  = 'May'
flight$Month[flight$Month == 6]  = 'Jun'
flight$Month[flight$Month == 7]  = 'Jul'
flight$Month[flight$Month == 8]  = 'Aug'
flight$Month[flight$Month == 9]  = 'Sep'
flight$Month[flight$Month == 10] = 'Oct'
flight$Month[flight$Month == 11] = 'Nov'
flight$Month[flight$Month == 12] = 'Dec'
flight$Month <- factor(flight$Month)

6.2.3 Converting cancelled flag from 0 and 1 to ‘N’ and ‘Y’ respectively

flight$Cancelled[flight$Cancelled == 0] = "N"
flight$Cancelled[flight$Cancelled == 1] = "Y"
flight$Cancelled <- factor(flight$Cancelled)

6.3 Creating different data frames for different delayed reasons

carrier_cancel    = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'A',]
weather_cancel  = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'B',]
nas_cancel      = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'C',]
security_cancel = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'D',]
flight$CancellationCode <- factor(flight$CancellationCode)
str(flight)
## 'data.frame':    700 obs. of  33 variables:
##  $ X.3              : int  5877 6347 2623 6490 1690 6022 1005 4282 6130 3096 ...
##  $ X.2              : int  12233 45278 1914 45003 35084 60316 3584 26851 10118 23446 ...
##  $ X.1              : int  419707 383820 502241 544300 261196 304594 684985 164963 610162 369467 ...
##  $ X                : int  529794 2402000 5238992 826229 858053 3439654 5887084 281445 5737438 6217356 ...
##  $ Year             : int  2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
##  $ Month            : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 9 12 4 4 7 11 5 11 10 ...
##  $ DayofMonth       : int  28 7 10 21 9 18 2 21 23 25 ...
##  $ DayOfWeek        : Factor w/ 7 levels "Fri","Mon","Sat",..: 2 7 7 5 3 7 5 2 5 6 ...
##  $ DepTime          : int  613 1209 722 942 1243 1432 744 1459 1855 742 ...
##  $ CRSDepTime       : int  615 1200 725 905 1249 1430 755 1440 1857 745 ...
##  $ ArrTime          : int  706 1303 855 1244 1507 1649 1011 1701 1940 955 ...
##  $ CRSArrTime       : int  706 1255 858 1217 1518 1621 1029 1652 1940 1012 ...
##  $ UniqueCarrier    : Factor w/ 20 levels "9E","AA","AQ",..: 3 18 13 16 17 13 5 17 11 7 ...
##  $ FlightNum        : int  42 24 866 528 206 955 401 1410 308 1112 ...
##  $ TailNum          : Factor w/ 641 levels "","80139E","80359E",..: 520 395 482 524 378 149 71 608 264 175 ...
##  $ ActualElapsedTime: int  53 54 93 122 84 257 207 122 45 73 ...
##  $ CRSElapsedTime   : int  51 55 93 132 89 231 214 132 43 87 ...
##  $ AirTime          : int  41 42 62 104 65 214 189 101 29 58 ...
##  $ ArrDelay         : int  0 8 -3 27 -11 28 -18 9 0 -17 ...
##  $ DepDelay         : int  -2 9 -3 37 -6 2 -11 19 -2 -3 ...
##  $ Origin           : Factor w/ 134 levels "ABQ","ACT","ACV",..: 55 56 92 99 22 84 16 7 55 70 ...
##  $ Dest             : Factor w/ 140 levels "ABI","ABQ","ALB",..: 61 32 133 17 108 69 93 107 66 129 ...
##  $ Distance         : int  216 239 449 867 369 1619 1367 665 163 368 ...
##  $ TaxiIn           : int  5 3 14 6 7 10 3 3 4 4 ...
##  $ TaxiOut          : int  7 9 17 12 12 33 15 18 12 11 ...
##  $ Cancelled        : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CancellationCode : Factor w/ 4 levels "","A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CarrierDelay     : int  NA NA NA 0 NA 0 NA NA NA NA ...
##  $ WeatherDelay     : int  NA NA NA 0 NA 0 NA NA NA NA ...
##  $ NASDelay         : int  NA NA NA 0 NA 28 NA NA NA NA ...
##  $ SecurityDelay    : int  NA NA NA 0 NA 0 NA NA NA NA ...
##  $ LateAircraftDelay: int  NA NA NA 27 NA 0 NA NA NA NA ...

6.4 Contingency table showing the number of flights cancelled (yes/no)

allcancelled <- table(flight$Cancelled)
allcancelled
## 
##   N   Y 
## 687  13

6.4.1 Contingency table showing the number of flights cancelled as per the cancellation code.

allcancelled1 <- table(flight$CancellationCode)
allcancelled1
## 
##       A   B   C 
## 687   1   8   4

6.5 Creating different data frames for different flight timings

not_cancelled      = flight[!(is.na(flight$DepDelay) | flight$DepDelay == ""), ]
delayed_flight     = not_cancelled[not_cancelled$DepDelay >  0, ]
on_time_flight     = not_cancelled[not_cancelled$DepDelay == 0, ]
before_time_flight = not_cancelled[not_cancelled$DepDelay <  0, ]

6.6 Counting the total number to cancellation due to all causes for each carrier

all_cancelled_table = table(flight$UniqueCarrier, flight$Cancelled)
write.csv(all_cancelled_table, "all_cancelled_count.csv")
all_cancelled_count = read.csv("all_cancelled_count.csv")
names(all_cancelled_count)[names(all_cancelled_count) == 'X'] = 'UniqueCarrier'
names(all_cancelled_count)[names(all_cancelled_count) == 'N'] = 'not_cancelled'
names(all_cancelled_count)[names(all_cancelled_count) == 'Y'] = 'total_cancelled'
rm(all_cancelled_table)

6.6.1 Contingency table specifying the number of flights cancelled per unique carrier

#my_table <- xtabs(~ UniqueCarrier + total_cancelled, data = all_cancelled_count)
#addmargins(my_table)

6.7 VISUALIZATIONS

6.7.1 Distribution of Carrier Delay

histogram(~ CarrierDelay, data = flight ,
          main = "Distribution of Carrier Delay",
          xlab = "Carrier Delay",col = "grey")

6.7.2 Distribution of Weather Delay

histogram(~ WeatherDelay, data = flight ,
          main = "Distribution of weather Delay",
          xlab = "Weather Delay",col = "grey")

6.7.3 Distribution of NAS Delay

histogram(~ NASDelay, data = flight ,
          main = "Distribution of NAS Delay",
          xlab = "NAS Delay",col = "grey")

6.7.4 Distribution of Security Delay

histogram(~ SecurityDelay, data = flight ,
          main = "Distribution of Security Delay",
          xlab = "Security Delay",col = "grey")

6.7.5 Distribution of delayed flights

scatterplot(DepDelay ~ DepTime, data = delayed_flight,
            main = "Scatterplot of Delayed Flights vs their Departure Time")

6.8 Analysing time depending on delayed flights

scatterplotMatrix(~ ArrDelay + DepDelay + DepTime + ArrTime + ActualElapsedTime, data = delayed_flight,
                  main = "Delaying of flights on various Time Factors")

7. T-tests

7.1 Performing T-tests on Departure and Arrival Delay

#t.test(DepDelay,ArrDelay,data = flight)

7.2 Performing T-tests on Departure Delay and Scheduled Departure Time

#t.test(DepDelay,DepTime,data = flight)

7.3 Performing T-tests on Distance and Departure Delay

#t.test(Distance,DepDelay,data = flight)

7.3.1 Visualising the effect of distance on delayed flights

scatterplot(DepDelay ~ Distance, data = delayed_flight,
            main = "Scatterplot of Delayed Flights vs their Distance")

8. Applying Correlation on Various time factors

8.1 Correlation Matrix

colflights <- c("ArrDelay","DepDelay","DepTime","ArrTime","ActualElapsedTime","AirTime")
corMatrix <- rcorr(as.matrix(flight[,colflights]))
corMatrix
##                   ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay              1.00     0.91    0.25    0.02              0.09
## DepDelay              0.91     1.00    0.29    0.00              0.02
## DepTime               0.25     0.29    1.00    0.67             -0.05
## ArrTime               0.02     0.00    0.67    1.00             -0.02
## ActualElapsedTime     0.09     0.02   -0.05   -0.02              1.00
## AirTime               0.02     0.00   -0.06   -0.03              0.98
##                   AirTime
## ArrDelay             0.02
## DepDelay             0.00
## DepTime             -0.06
## ArrTime             -0.03
## ActualElapsedTime    0.98
## AirTime              1.00
## 
## n
##                   ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay               686      686     686     686               686
## DepDelay               686      687     687     687               686
## DepTime                686      687     687     687               686
## ArrTime                686      687     687     687               686
## ActualElapsedTime      686      686     686     686               686
## AirTime                686      686     686     686               686
##                   AirTime
## ArrDelay              686
## DepDelay              686
## DepTime               686
## ArrTime               686
## ActualElapsedTime     686
## AirTime               686
## 
## P
##                   ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay                   0.0000   0.0000  0.5863  0.0243           
## DepDelay          0.0000            0.0000  0.9965  0.6508           
## DepTime           0.0000   0.0000           0.0000  0.2208           
## ArrTime           0.5863   0.9965   0.0000          0.5561           
## ActualElapsedTime 0.0243   0.6508   0.2208  0.5561                   
## AirTime           0.6013   0.9960   0.1203  0.3696  0.0000           
##                   AirTime
## ArrDelay          0.6013 
## DepDelay          0.9960 
## DepTime           0.1203 
## ArrTime           0.3696 
## ActualElapsedTime 0.0000 
## AirTime

8.1.1 Visualising Correlation Matrix by Corrgram

corrgram(flight[,colflights])

9. REGRESSION

Fomulationg multivariate linear regression model to fit departure delay with respect to the model selection

9.1 Proposed Model 1

Independent Variables: {“ArrDelay”,“DepTime”,“ArrTime”,“ActualElapsedTime”,“AirTime”,“CarrierDelay”,“NASDelay”,“WeatherDelay”, “Distance”}

Dependent Variable : {“DepDelay”}

lm_model <- DepDelay ~ ArrTime + DepTime + ActualElapsedTime + ArrDelay +  CarrierDelay + NASDelay + WeatherDelay  + AirTime + Distance 
fit <- lm(lm_model,data = flight)
summary(fit)
## 
## Call:
## lm(formula = lm_model, data = flight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.2315  -5.3353  -0.9217   5.4270  27.2667 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       21.423013   4.282595   5.002 1.49e-06 ***
## ArrTime           -0.001450   0.001509  -0.961  0.33811    
## DepTime            0.005012   0.001996   2.510  0.01307 *  
## ActualElapsedTime -0.756171   0.043730 -17.292  < 2e-16 ***
## ArrDelay           0.978137   0.022125  44.209  < 2e-16 ***
## CarrierDelay      -0.018158   0.041519  -0.437  0.66247    
## NASDelay          -0.099523   0.033412  -2.979  0.00335 ** 
## WeatherDelay       0.119811   0.074927   1.599  0.11181    
## AirTime            0.367126   0.058026   6.327 2.46e-09 ***
## Distance           0.045913   0.006459   7.109 3.80e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.32 on 158 degrees of freedom
##   (532 observations deleted due to missingness)
## Multiple R-squared:  0.9579, Adjusted R-squared:  0.9555 
## F-statistic: 399.4 on 9 and 158 DF,  p-value: < 2.2e-16

9.1.1 Predicting the best variables for model 1

leap <- regsubsets(lm_model,data=flight, nbest = 1)
plot(leap, scale = "adjr2")

9.2 Formulating model 2

Independent Variables: {“ArrDelay”,“DepTime”,“ArrTime”,“ActualElapsedTime”,“AirTime”,“NASDelay”,“WeatherDelay”, “Distance”}

Dependent Variable : {“DepDelay”}

lm_model1 <- DepDelay ~ + ArrTime + DepTime + ActualElapsedTime + ArrDelay + NASDelay + WeatherDelay + AirTime + Distance 
fit1 <- lm(lm_model1,data = flight)
summary(fit1)
## 
## Call:
## lm(formula = lm_model1, data = flight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.3450  -5.2742  -0.8317   5.1974  27.3391 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       21.548052   4.262160   5.056 1.17e-06 ***
## ArrTime           -0.001588   0.001471  -1.080  0.28191    
## DepTime            0.005025   0.001991   2.524  0.01259 *  
## ActualElapsedTime -0.755842   0.043612 -17.331  < 2e-16 ***
## ArrDelay           0.975834   0.021435  45.526  < 2e-16 ***
## NASDelay          -0.095942   0.032310  -2.969  0.00345 ** 
## WeatherDelay       0.123855   0.074164   1.670  0.09688 .  
## AirTime            0.363753   0.057364   6.341 2.26e-09 ***
## Distance           0.046306   0.006380   7.258 1.64e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.296 on 159 degrees of freedom
##   (532 observations deleted due to missingness)
## Multiple R-squared:  0.9578, Adjusted R-squared:  0.9557 
## F-statistic: 451.6 on 8 and 159 DF,  p-value: < 2.2e-16

9.2.1 Predicting the best variables for model 2

leap <- regsubsets(lm_model1,data=flight, nbest = 1)
plot(leap, scale = "adjr2")

9.3.2 The Beta Coefficients Plot

library(coefplot)
coefplot(fit1, intercept = FALSE, outerCI = 1.96, coefficients = c("ArrTime","DepTime","ActualElapsedTime", "ArrDelay","NASDelay", "WeatherDelay","Airtime","Distance"))