1. INTRODUCTION
US Airways (formerly known as USAir) was a major American airline that ceased to operate independently when the Federal Aviation Administration granted a single operating certificate (SOC) for US Airways and American Airlines on April 8, 2015. Publicly, the two carriers appeared to merge when their reservations systems and booking processes were merged on October 17, 2015; however, other systems were still separate at that time. The airline had an extensive international and domestic network, with 193 destinations in 24 countries in North America, South America, Europe and the Middle East. The airline was a member of the Star Alliance, before becoming an affiliate member of Oneworld in March 2014. US Airways utilized a fleet of 343 mainline jet aircraft, as well as 278 regional jet and turbo-prop aircraft operated by contract and subsidiary airlines under the name US Airways Express via code sharing agreements.This paper addresses the issues concerning the delaying of flights.In this paper, we evaluate various factors which can be responsible for the delaying of flights.
2. OVERVIEW OF THE STUDY
Our field study concerns with the various factors causing the delay of USAirways’ flights. The dataset comprises of factors such as departure time, delay time, arrival delay time, elapsed time, and factors like weather, security, late aircraft, carrier and NAS(National Aviation System) delays.We empirically study how these mentioned factors influences the delay of flights.Our Regression Analysis reveals that the departure delay of flights are highly caused due to the delayed arrival of flights, NAS delays,departutre time and on the elapsed time to the flight.Our analysis of the USAirways flights indicates a significant “delaying of flights”.
2.1.DATA DESCRIPTION
For this study, the data collected has about 33 variables consisting of flight number,date, day and month of the flights departing and arriving, with their original and delayed departure and arrival time. The data also consist of variables which can be the cause of delayed flights like weather, security, nas and carrier delays.The data consists about 700 rows approximately.The airlines report the causes of delay in broad categories that were created by the Air Carrier On-Time Reporting Advisory Committee. The categories are Air Carrier, National Aviation System, Weather, Late-Arriving Aircraft and Security. The causes of cancellation are the same, except there is no late-arriving aircraft category.
Air Carrier: The cause of the cancellation or delay was due to circumstances within the airline’s control (e.g. maintenance or crew problems, aircraft cleaning, baggage loading, fueling, etc.).
Extreme Weather: Significant meteorological conditions (actual or forecasted) that, in the judgment of the carrier, delays or prevents the operation of a flight such as tornado, blizzard or hurricane.
National Aviation System (NAS): Delays and cancellations attributable to the national aviation system that refer to a broad set of conditions, such as non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control.
Late-arriving aircraft: A previous flight with same aircraft arrived late, causing the present flight to depart late.
Security: Delays or cancellations caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
3. MODEL ANALYSIS
HYPOTHESIS H1: THE TIME OF DEPARTURE INCREASES AS THE THE TIME OF ARRIVAL INCREASES.
In order to test the above Hypothesis, we proposed the following model :
DepartureDelay = a0 + a1ArrivalDelay + a2NasDelay + a3WeatherDelay + a4ActualElapsedTime + a5Airtime + a6Distance + a7ArrivalTime + a8DepartureTime.
3.1. RESULTS
We established the factors which are causing the delay in flights. We estimated model, using linear least squares. If there was a delay in flights, we expected to find the coefficients of DepDelay to be positive.
We found empirical support for H1. The departure delay of flights depends on the delay of arrival time of the flights.
THE COEFFICIENTS, “ActualElapsedTime”, “ArrivalDelay”,“AirTIme” and “Distance” ARE STATISTICALLY SIGNIFICANT WITH THE DELAY IN DEPARTURES OF THE FLIGHTS.
The regression coefficient (21.54) is significantly dfferent from zero (p < 0.001)
There is an expected delay of deparutre of 21 min for every 10 min delay in the arrival of flights.
THE MULTIPLE R-SQUARED (0.9578) INDICATES THAT THE MODEL ACCOUNTS FOR 95.78% OF THE VARIANCE IN THE DEPARTURE DELAYS
THE ADJUSTED R-SQUARED ( 0.9557) INDICATEs THAT THE MODEL IS 95.57% LINEARLY FITTED.
THE RESIDUAL STANDARD ERROR (9.296) CAN BE THOUGHT OF AS THE AVERAGE ERROR IN PREDICTING THE DELAY IN DEPARTURE OF FLIGHTS USING THIS MODEL
THE F-STATISITCS PREDICT THAT THE MODEL IS HIGHLY SIGNIFICANT AS P-VALUE IS 2.2e-16 (p< 0.001)
4. CONCLUSION
This Project is motivated by the need for research that could improve our understanding of how various factors influence the delay of flights in USAirways.We found that the delay in arrival time of the flights causes the delay in the departure time of the flights.
5. REFERENCES
[1] IMS PRO SCHOOL CASESTUDY
6. APPENDIX (SOURCE CODE)
6.1 PRELIMINARY WORK
6.1.1 LIBRARIES
library(psych)
library("lattice")
library("gplots")
library("car")
library("corrplot")
library(leaps)
library("corrgram")
library(Hmisc)
6.1.2 Reading the raw data into a dataframe
setwd("C:/Users/Bagga/Desktop/Internship 2018/Project")
start.time = Sys.time()
flight = read.csv("Airlines.csv", header = TRUE)
end.time = Sys.time() # To check the time required to train the model
end.time - start.time
## Time difference of 0.015625 secs
describe(flight)
## flight
##
## 33 Variables 700 Observations
## ---------------------------------------------------------------------------
## X.3
## n missing distinct Info Mean Gmd .05 .10
## 700 0 700 1 3550 2320 350.9 708.9
## .25 .50 .75 .90 .95
## 1781.8 3676.5 5296.0 6244.0 6618.1
##
## lowest : 21 25 42 46 56, highest: 6893 6896 6917 6934 6942
## ---------------------------------------------------------------------------
## X.2
## n missing distinct Info Mean Gmd .05 .10
## 700 0 700 1 34199 23192 4244 7143
## .25 .50 .75 .90 .95
## 16393 34017 51941 62765 66105
##
## lowest : 85 166 228 435 501, highest: 69771 69809 69833 70026 70050
## ---------------------------------------------------------------------------
## X.1
## n missing distinct Info Mean Gmd .05 .10
## 700 0 700 1 353264 236796 37528 65595
## .25 .50 .75 .90 .95
## 175505 356573 528212 631514 666984
##
## lowest : 67 2267 2268 2728 2985, highest: 698042 699022 699629 699775 700332
## ---------------------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd .05 .10
## 700 0 700 1 3430628 2324322 295527 662373
## .25 .50 .75 .90 .95
## 1734461 3415473 5115834 6266653 6599109
##
## lowest : 2610 13514 17400 44951 45971
## highest: 6924792 6935361 6952119 6975314 6980407
## ---------------------------------------------------------------------------
## Year
## n missing distinct Info Mean Gmd
## 700 0 1 0 2008 0
##
## Value 2008
## Frequency 700
## Proportion 1
## ---------------------------------------------------------------------------
## Month
## n missing distinct Info Mean Gmd .05 .10
## 700 0 12 0.993 6.243 3.896 1 2
## .25 .50 .75 .90 .95
## 3 6 9 11 12
##
## Value 1 2 3 4 5 6 7 8 9 10
## Frequency 64 51 70 69 61 51 72 62 46 54
## Proportion 0.091 0.073 0.100 0.099 0.087 0.073 0.103 0.089 0.066 0.077
##
## Value 11 12
## Frequency 47 53
## Proportion 0.067 0.076
## ---------------------------------------------------------------------------
## DayofMonth
## n missing distinct Info Mean Gmd .05 .10
## 700 0 31 0.999 15.37 10.46 2 3
## .25 .50 .75 .90 .95
## 7 16 23 28 29
##
## lowest : 1 2 3 4 5, highest: 27 28 29 30 31
## ---------------------------------------------------------------------------
## DayOfWeek
## n missing distinct Info Mean Gmd
## 700 0 7 0.979 3.896 2.208
##
## Value 1 2 3 4 5 6 7
## Frequency 96 99 121 112 102 78 92
## Proportion 0.137 0.141 0.173 0.160 0.146 0.111 0.131
## ---------------------------------------------------------------------------
## DepTime
## n missing distinct Info Mean Gmd .05 .10
## 687 13 494 1 1374 533.6 640.3 736.2
## .25 .50 .75 .90 .95
## 1010.0 1355.0 1730.5 2003.8 2121.7
##
## lowest : 2 530 538 541 542, highest: 2310 2314 2321 2340 2355
## ---------------------------------------------------------------------------
## CRSDepTime
## n missing distinct Info Mean Gmd .05 .10
## 700 0 317 1 1362 515.3 645 740
## .25 .50 .75 .90 .95
## 1014 1342 1719 1955 2105
##
## lowest : 535 540 545 557 600, highest: 2240 2254 2300 2315 2355
## ---------------------------------------------------------------------------
## ArrTime
## n missing distinct Info Mean Gmd .05 .10
## 687 13 501 1 1515 556.5 749.6 853.0
## .25 .50 .75 .90 .95
## 1152.0 1543.0 1911.0 2138.4 2233.0
##
## lowest : 3 4 17 23 34, highest: 2343 2346 2350 2351 2356
## ---------------------------------------------------------------------------
## CRSArrTime
## n missing distinct Info Mean Gmd .05 .10
## 700 0 448 1 1534 527.8 819.9 912.7
## .25 .50 .75 .90 .95
## 1209.0 1550.5 1910.0 2145.8 2235.2
##
## lowest : 4 5 20 40 137, highest: 2350 2354 2355 2358 2359
## ---------------------------------------------------------------------------
## UniqueCarrier
## n missing distinct
## 700 0 20
##
## Value 9E AA AQ AS B6 CO DL EV F9 FL
## Frequency 23 49 1 22 32 28 37 21 10 30
## Proportion 0.033 0.070 0.001 0.031 0.046 0.040 0.053 0.030 0.014 0.043
##
## Value HA MQ NW OH OO UA US WN XE YV
## Frequency 8 60 39 17 59 42 47 118 28 29
## Proportion 0.011 0.086 0.056 0.024 0.084 0.060 0.067 0.169 0.040 0.041
## ---------------------------------------------------------------------------
## FlightNum
## n missing distinct Info Mean Gmd .05 .10
## 700 0 642 1 2156 2112 111.0 231.9
## .25 .50 .75 .90 .95
## 542.8 1470.0 3433.5 5368.0 5920.6
##
## lowest : 1 3 8 9 10, highest: 7275 7307 7311 7755 7773
## ---------------------------------------------------------------------------
## TailNum
## n missing distinct
## 700 0 641
##
## lowest : 80139E 80359E 80419E 83909E, highest: N986CA N989AT N989CA N992DL N995AT
## ---------------------------------------------------------------------------
## ActualElapsedTime
## n missing distinct Info Mean Gmd .05 .10
## 686 14 227 1 130 75.75 53.0 61.0
## .25 .50 .75 .90 .95
## 79.0 111.0 161.0 227.5 284.5
##
## lowest : 32 35 36 38 39, highest: 403 407 414 422 576
## ---------------------------------------------------------------------------
## CRSElapsedTime
## n missing distinct Info Mean Gmd .05 .10
## 700 0 213 1 131.1 74.8 52.95 65.00
## .25 .50 .75 .90 .95
## 80.00 111.00 162.00 229.00 279.00
##
## lowest : 30 34 35 37 40, highest: 389 406 410 415 575
## ---------------------------------------------------------------------------
## AirTime
## n missing distinct Info Mean Gmd .05 .10
## 686 14 223 1 106.4 72 34.0 42.5
## .25 .50 .75 .90 .95
## 57.0 88.0 135.0 197.5 249.5
##
## lowest : 15 19 20 21 22, highest: 363 374 378 396 554
## ---------------------------------------------------------------------------
## ArrDelay
## n missing distinct Info Mean Gmd .05 .10
## 686 14 126 1 8.57 30.28 -23 -17
## .25 .50 .75 .90 .95
## -9 -1 14 41 71
##
## lowest : -62 -60 -38 -37 -34, highest: 169 177 180 234 264
## ---------------------------------------------------------------------------
## DepDelay
## n missing distinct Info Mean Gmd .05 .10
## 687 13 106 0.998 10.17 23.85 -9.0 -7.0
## .25 .50 .75 .90 .95
## -4.0 0.0 11.0 37.4 72.0
##
## lowest : -26 -21 -15 -14 -13, highest: 171 173 175 214 262
## ---------------------------------------------------------------------------
## Origin
## n missing distinct
## 700 0 134
##
## lowest : ABQ ACT ACV ALB ANC, highest: TUL TUS TYS VPS XNA
## ---------------------------------------------------------------------------
## Dest
## n missing distinct
## 700 0 140
##
## lowest : ABI ABQ ALB AMA ANC, highest: TUL TUS TYS WRG XNA
## ---------------------------------------------------------------------------
## Distance
## n missing distinct Info Mean Gmd .05 .10
## 700 0 448 1 743.1 594.5 156.0 214.0
## .25 .50 .75 .90 .95
## 330.5 589.0 967.0 1522.0 1979.5
##
## lowest : 49 67 74 82 86, highest: 2689 2762 2936 2979 4962
## ---------------------------------------------------------------------------
## TaxiIn
## n missing distinct Info Mean Gmd .05 .10
## 687 13 31 0.986 6.844 4.546 2 3
## .25 .50 .75 .90 .95
## 4 6 8 12 15
##
## lowest : 1 2 3 4 5, highest: 34 35 39 44 77
## ---------------------------------------------------------------------------
## TaxiOut
## n missing distinct Info Mean Gmd .05 .10
## 687 13 56 0.997 16.96 10.49 7 8
## .25 .50 .75 .90 .95
## 10 14 19 28 35
##
## lowest : 4 5 6 7 8, highest: 81 88 102 137 152
## ---------------------------------------------------------------------------
## Cancelled
## n missing distinct Info Sum Mean Gmd
## 700 0 2 0.055 13 0.01857 0.03651
##
## ---------------------------------------------------------------------------
## CancellationCode
## n missing distinct
## 700 0 4
##
## Value A B C
## Frequency 687 1 8 4
## Proportion 0.981 0.001 0.011 0.006
## ---------------------------------------------------------------------------
## Diverted
## n missing distinct Info Sum Mean Gmd
## 700 0 2 0.004 1 0.001429 0.002857
##
## ---------------------------------------------------------------------------
## CarrierDelay
## n missing distinct Info Mean Gmd .05 .10
## 168 532 38 0.756 8.649 14.32 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 0.0 8.0 30.0 43.6
##
## lowest : 0 1 2 3 4, highest: 76 80 94 96 108
## ---------------------------------------------------------------------------
## WeatherDelay
## n missing distinct Info Mean Gmd .05 .10
## 168 532 11 0.184 2.065 4.002 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 0.00 0.00 0.00 7.65
##
## Value 0 2 7 8 13 20 30 35 50 72
## Frequency 157 1 1 1 1 1 2 1 1 1
## Proportion 0.935 0.006 0.006 0.006 0.006 0.006 0.012 0.006 0.006 0.006
##
## Value 80
## Frequency 1
## Proportion 0.006
## ---------------------------------------------------------------------------
## NASDelay
## n missing distinct Info Mean Gmd .05 .10
## 168 532 55 0.947 18.38 26.2 0.00 0.00
## .25 .50 .75 .90 .95
## 0.00 5.50 23.25 51.30 80.95
##
## lowest : 0 1 2 3 4, highest: 97 146 154 157 167
## ---------------------------------------------------------------------------
## SecurityDelay
## n missing distinct Info Mean Gmd
## 168 532 1 0 0 0
##
## Value 0
## Frequency 168
## Proportion 1
## ---------------------------------------------------------------------------
## LateAircraftDelay
## n missing distinct Info Mean Gmd .05 .10
## 168 532 57 0.9 22.62 33.46 0.0 0.0
## .25 .50 .75 .90 .95
## 0.0 4.5 28.0 65.2 101.6
##
## lowest : 0 1 3 4 5, highest: 151 173 174 209 242
## ---------------------------------------------------------------------------
6.2 Inspect the datatypes. Convering the data type of some coloumns
6.2.1 Converting day of week
flight$DayOfWeek[flight$DayOfWeek == 1] = 'Mon'
flight$DayOfWeek[flight$DayOfWeek == 2] = 'Tue'
flight$DayOfWeek[flight$DayOfWeek == 3] = 'Wed'
flight$DayOfWeek[flight$DayOfWeek == 4] = 'Thu'
flight$DayOfWeek[flight$DayOfWeek == 5] = 'Fri'
flight$DayOfWeek[flight$DayOfWeek == 6] = 'Sat'
flight$DayOfWeek[flight$DayOfWeek == 7] = 'Sun'
flight$DayOfWeek <- factor(flight$DayOfWeek)
6.2.2 Converting month
flight$Month[flight$Month == 1] = 'Jan'
flight$Month[flight$Month == 2] = 'Feb'
flight$Month[flight$Month == 3] = 'Mar'
flight$Month[flight$Month == 4] = 'Apr'
flight$Month[flight$Month == 5] = 'May'
flight$Month[flight$Month == 6] = 'Jun'
flight$Month[flight$Month == 7] = 'Jul'
flight$Month[flight$Month == 8] = 'Aug'
flight$Month[flight$Month == 9] = 'Sep'
flight$Month[flight$Month == 10] = 'Oct'
flight$Month[flight$Month == 11] = 'Nov'
flight$Month[flight$Month == 12] = 'Dec'
flight$Month <- factor(flight$Month)
6.2.3 Converting cancelled flag from 0 and 1 to ‘N’ and ‘Y’ respectively
flight$Cancelled[flight$Cancelled == 0] = "N"
flight$Cancelled[flight$Cancelled == 1] = "Y"
flight$Cancelled <- factor(flight$Cancelled)
6.3 Creating different data frames for different delayed reasons
carrier_cancel = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'A',]
weather_cancel = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'B',]
nas_cancel = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'C',]
security_cancel = flight[flight$Cancelled == 'Y' & flight$CancellationCode == 'D',]
flight$CancellationCode <- factor(flight$CancellationCode)
str(flight)
## 'data.frame': 700 obs. of 33 variables:
## $ X.3 : int 5877 6347 2623 6490 1690 6022 1005 4282 6130 3096 ...
## $ X.2 : int 12233 45278 1914 45003 35084 60316 3584 26851 10118 23446 ...
## $ X.1 : int 419707 383820 502241 544300 261196 304594 684985 164963 610162 369467 ...
## $ X : int 529794 2402000 5238992 826229 858053 3439654 5887084 281445 5737438 6217356 ...
## $ Year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 9 12 4 4 7 11 5 11 10 ...
## $ DayofMonth : int 28 7 10 21 9 18 2 21 23 25 ...
## $ DayOfWeek : Factor w/ 7 levels "Fri","Mon","Sat",..: 2 7 7 5 3 7 5 2 5 6 ...
## $ DepTime : int 613 1209 722 942 1243 1432 744 1459 1855 742 ...
## $ CRSDepTime : int 615 1200 725 905 1249 1430 755 1440 1857 745 ...
## $ ArrTime : int 706 1303 855 1244 1507 1649 1011 1701 1940 955 ...
## $ CRSArrTime : int 706 1255 858 1217 1518 1621 1029 1652 1940 1012 ...
## $ UniqueCarrier : Factor w/ 20 levels "9E","AA","AQ",..: 3 18 13 16 17 13 5 17 11 7 ...
## $ FlightNum : int 42 24 866 528 206 955 401 1410 308 1112 ...
## $ TailNum : Factor w/ 641 levels "","80139E","80359E",..: 520 395 482 524 378 149 71 608 264 175 ...
## $ ActualElapsedTime: int 53 54 93 122 84 257 207 122 45 73 ...
## $ CRSElapsedTime : int 51 55 93 132 89 231 214 132 43 87 ...
## $ AirTime : int 41 42 62 104 65 214 189 101 29 58 ...
## $ ArrDelay : int 0 8 -3 27 -11 28 -18 9 0 -17 ...
## $ DepDelay : int -2 9 -3 37 -6 2 -11 19 -2 -3 ...
## $ Origin : Factor w/ 134 levels "ABQ","ACT","ACV",..: 55 56 92 99 22 84 16 7 55 70 ...
## $ Dest : Factor w/ 140 levels "ABI","ABQ","ALB",..: 61 32 133 17 108 69 93 107 66 129 ...
## $ Distance : int 216 239 449 867 369 1619 1367 665 163 368 ...
## $ TaxiIn : int 5 3 14 6 7 10 3 3 4 4 ...
## $ TaxiOut : int 7 9 17 12 12 33 15 18 12 11 ...
## $ Cancelled : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ CancellationCode : Factor w/ 4 levels "","A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CarrierDelay : int NA NA NA 0 NA 0 NA NA NA NA ...
## $ WeatherDelay : int NA NA NA 0 NA 0 NA NA NA NA ...
## $ NASDelay : int NA NA NA 0 NA 28 NA NA NA NA ...
## $ SecurityDelay : int NA NA NA 0 NA 0 NA NA NA NA ...
## $ LateAircraftDelay: int NA NA NA 27 NA 0 NA NA NA NA ...
6.4 Contingency table showing the number of flights cancelled (yes/no)
allcancelled <- table(flight$Cancelled)
allcancelled
##
## N Y
## 687 13
6.4.1 Contingency table showing the number of flights cancelled as per the cancellation code.
allcancelled1 <- table(flight$CancellationCode)
allcancelled1
##
## A B C
## 687 1 8 4
6.5 Creating different data frames for different flight timings
not_cancelled = flight[!(is.na(flight$DepDelay) | flight$DepDelay == ""), ]
delayed_flight = not_cancelled[not_cancelled$DepDelay > 0, ]
on_time_flight = not_cancelled[not_cancelled$DepDelay == 0, ]
before_time_flight = not_cancelled[not_cancelled$DepDelay < 0, ]
6.6 Counting the total number to cancellation due to all causes for each carrier
all_cancelled_table = table(flight$UniqueCarrier, flight$Cancelled)
write.csv(all_cancelled_table, "all_cancelled_count.csv")
all_cancelled_count = read.csv("all_cancelled_count.csv")
names(all_cancelled_count)[names(all_cancelled_count) == 'X'] = 'UniqueCarrier'
names(all_cancelled_count)[names(all_cancelled_count) == 'N'] = 'not_cancelled'
names(all_cancelled_count)[names(all_cancelled_count) == 'Y'] = 'total_cancelled'
rm(all_cancelled_table)
6.6.1 Contingency table specifying the number of flights cancelled per unique carrier
#my_table <- xtabs(~ UniqueCarrier + total_cancelled, data = all_cancelled_count)
#addmargins(my_table)
6.7 VISUALIZATIONS
6.7.1 Distribution of Carrier Delay
histogram(~ CarrierDelay, data = flight ,
main = "Distribution of Carrier Delay",
xlab = "Carrier Delay",col = "grey")

6.7.2 Distribution of Weather Delay
histogram(~ WeatherDelay, data = flight ,
main = "Distribution of weather Delay",
xlab = "Weather Delay",col = "grey")

6.7.3 Distribution of NAS Delay
histogram(~ NASDelay, data = flight ,
main = "Distribution of NAS Delay",
xlab = "NAS Delay",col = "grey")

6.7.4 Distribution of Security Delay
histogram(~ SecurityDelay, data = flight ,
main = "Distribution of Security Delay",
xlab = "Security Delay",col = "grey")

6.7.5 Distribution of delayed flights
scatterplot(DepDelay ~ DepTime, data = delayed_flight,
main = "Scatterplot of Delayed Flights vs their Departure Time")

6.8 Analysing time depending on delayed flights
scatterplotMatrix(~ ArrDelay + DepDelay + DepTime + ArrTime + ActualElapsedTime, data = delayed_flight,
main = "Delaying of flights on various Time Factors")

7. T-tests
7.3.1 Visualising the effect of distance on delayed flights
scatterplot(DepDelay ~ Distance, data = delayed_flight,
main = "Scatterplot of Delayed Flights vs their Distance")

8. Applying Correlation on Various time factors
8.1 Correlation Matrix
colflights <- c("ArrDelay","DepDelay","DepTime","ArrTime","ActualElapsedTime","AirTime")
corMatrix <- rcorr(as.matrix(flight[,colflights]))
corMatrix
## ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay 1.00 0.91 0.25 0.02 0.09
## DepDelay 0.91 1.00 0.29 0.00 0.02
## DepTime 0.25 0.29 1.00 0.67 -0.05
## ArrTime 0.02 0.00 0.67 1.00 -0.02
## ActualElapsedTime 0.09 0.02 -0.05 -0.02 1.00
## AirTime 0.02 0.00 -0.06 -0.03 0.98
## AirTime
## ArrDelay 0.02
## DepDelay 0.00
## DepTime -0.06
## ArrTime -0.03
## ActualElapsedTime 0.98
## AirTime 1.00
##
## n
## ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay 686 686 686 686 686
## DepDelay 686 687 687 687 686
## DepTime 686 687 687 687 686
## ArrTime 686 687 687 687 686
## ActualElapsedTime 686 686 686 686 686
## AirTime 686 686 686 686 686
## AirTime
## ArrDelay 686
## DepDelay 686
## DepTime 686
## ArrTime 686
## ActualElapsedTime 686
## AirTime 686
##
## P
## ArrDelay DepDelay DepTime ArrTime ActualElapsedTime
## ArrDelay 0.0000 0.0000 0.5863 0.0243
## DepDelay 0.0000 0.0000 0.9965 0.6508
## DepTime 0.0000 0.0000 0.0000 0.2208
## ArrTime 0.5863 0.9965 0.0000 0.5561
## ActualElapsedTime 0.0243 0.6508 0.2208 0.5561
## AirTime 0.6013 0.9960 0.1203 0.3696 0.0000
## AirTime
## ArrDelay 0.6013
## DepDelay 0.9960
## DepTime 0.1203
## ArrTime 0.3696
## ActualElapsedTime 0.0000
## AirTime
8.1.1 Visualising Correlation Matrix by Corrgram
corrgram(flight[,colflights])

9. REGRESSION
Fomulationg multivariate linear regression model to fit departure delay with respect to the model selection
9.1 Proposed Model 1
Independent Variables: {“ArrDelay”,“DepTime”,“ArrTime”,“ActualElapsedTime”,“AirTime”,“CarrierDelay”,“NASDelay”,“WeatherDelay”, “Distance”}
Dependent Variable : {“DepDelay”}
lm_model <- DepDelay ~ ArrTime + DepTime + ActualElapsedTime + ArrDelay + CarrierDelay + NASDelay + WeatherDelay + AirTime + Distance
fit <- lm(lm_model,data = flight)
summary(fit)
##
## Call:
## lm(formula = lm_model, data = flight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.2315 -5.3353 -0.9217 5.4270 27.2667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.423013 4.282595 5.002 1.49e-06 ***
## ArrTime -0.001450 0.001509 -0.961 0.33811
## DepTime 0.005012 0.001996 2.510 0.01307 *
## ActualElapsedTime -0.756171 0.043730 -17.292 < 2e-16 ***
## ArrDelay 0.978137 0.022125 44.209 < 2e-16 ***
## CarrierDelay -0.018158 0.041519 -0.437 0.66247
## NASDelay -0.099523 0.033412 -2.979 0.00335 **
## WeatherDelay 0.119811 0.074927 1.599 0.11181
## AirTime 0.367126 0.058026 6.327 2.46e-09 ***
## Distance 0.045913 0.006459 7.109 3.80e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.32 on 158 degrees of freedom
## (532 observations deleted due to missingness)
## Multiple R-squared: 0.9579, Adjusted R-squared: 0.9555
## F-statistic: 399.4 on 9 and 158 DF, p-value: < 2.2e-16
9.1.1 Predicting the best variables for model 1
leap <- regsubsets(lm_model,data=flight, nbest = 1)
plot(leap, scale = "adjr2")

9.2 Formulating model 2
Independent Variables: {“ArrDelay”,“DepTime”,“ArrTime”,“ActualElapsedTime”,“AirTime”,“NASDelay”,“WeatherDelay”, “Distance”}
Dependent Variable : {“DepDelay”}
lm_model1 <- DepDelay ~ + ArrTime + DepTime + ActualElapsedTime + ArrDelay + NASDelay + WeatherDelay + AirTime + Distance
fit1 <- lm(lm_model1,data = flight)
summary(fit1)
##
## Call:
## lm(formula = lm_model1, data = flight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.3450 -5.2742 -0.8317 5.1974 27.3391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.548052 4.262160 5.056 1.17e-06 ***
## ArrTime -0.001588 0.001471 -1.080 0.28191
## DepTime 0.005025 0.001991 2.524 0.01259 *
## ActualElapsedTime -0.755842 0.043612 -17.331 < 2e-16 ***
## ArrDelay 0.975834 0.021435 45.526 < 2e-16 ***
## NASDelay -0.095942 0.032310 -2.969 0.00345 **
## WeatherDelay 0.123855 0.074164 1.670 0.09688 .
## AirTime 0.363753 0.057364 6.341 2.26e-09 ***
## Distance 0.046306 0.006380 7.258 1.64e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.296 on 159 degrees of freedom
## (532 observations deleted due to missingness)
## Multiple R-squared: 0.9578, Adjusted R-squared: 0.9557
## F-statistic: 451.6 on 8 and 159 DF, p-value: < 2.2e-16
9.2.1 Predicting the best variables for model 2
leap <- regsubsets(lm_model1,data=flight, nbest = 1)
plot(leap, scale = "adjr2")

9.3.2 The Beta Coefficients Plot
library(coefplot)
coefplot(fit1, intercept = FALSE, outerCI = 1.96, coefficients = c("ArrTime","DepTime","ActualElapsedTime", "ArrDelay","NASDelay", "WeatherDelay","Airtime","Distance"))
