MATH1324 Introduction to Statistics Assignment 3 - Final Project

Flight Arrival and Departure Delay in USA

ALI HAIDER (s3622366) and Venkata Mallikarjun (s3642400)

October 28, 2018

Introduction

In the contemporary world traveling via airplanes has become a regular feature of human life.

-The airplane traveling can be for personal or business purpose and number of domestic and international airlines has emerged in the recent past.

-It is very important for the country’s immigation and airport departments to see whether there are flights delay at arrival or departure of the flights.

Problem Statement

The world has seen ever increasing number of flights everyday and coping with arrival and departure of the flights on time is really hard.

-USA is having hundreds and thousands of flights departing and arriving everyday and there are delays in the flights.

-This project is to investigate where actually the delays happening at the airport i-e is it the arrival delay or the departure delay.

-The need of the hour is to propely implement statistical analysis on the dataset and do hypothesis testing to pinpoint the area of issue and then to figure out wayout to improve or to overcome the delay issue.

Data Collection and sampling

As USA is the world super power and driving around one-third of the global economy, that is one of the reasons that there is a huge influx of people traveling to USA.

-By looking at this aspect I have decided to get the data of the airports of different states of the USA and record the flights with the arrival and departure delay in the respective flights.

-Normally it is assumed that the delay is at the arrival of the flights and not at the departure.

-As airports have different procedures at arrival and departure lounge so once we do statistical analysis on the given data, we can come the logical conclusion where the issue is and how it can be sorted out.

airlines <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flight-delays/airlines.csv")
airports <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/airports.csv")
flights <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flights.csv")

#Summary stats

# This section shows the dataset with key subvariables along with size of file in MBs.

df.info <- function(x) {
  dat  <- as.character(substitute(x))  ##data frame name
  size <- format(object.size(x), units="Mb")  ##size of data frame in Mb
  
  ##column information
  column.info <- data.frame( column        = names(sapply(x, class)),
                             class         = sapply(x, class),
                             unique.values = sapply(x, function(y) length(unique(y))),
                             missing.count = colSums(is.na(x)),
                             missing.pct   = round(colSums(is.na(x)) / nrow(x) * 100, 2))
  
  row.names(column.info) <- 1:nrow(column.info)
  
  list(data.frame     = data.frame(name=dat, size=size),
       dimensions     = data.frame(rows=nrow(x), columns=ncol(x)),
       column.details = column.info)
}

df.info(flights)
## $data.frame
##      name     size
## 1 flights 688.5 Mb
## 
## $dimensions
##      rows columns
## 1 5819079      31
## 
## $column.details
##                 column   class unique.values missing.count missing.pct
## 1                 YEAR integer             1             0        0.00
## 2                MONTH integer            12             0        0.00
## 3                  DAY integer            31             0        0.00
## 4          DAY_OF_WEEK integer             7             0        0.00
## 5              AIRLINE  factor            14             0        0.00
## 6        FLIGHT_NUMBER integer          6952             0        0.00
## 7          TAIL_NUMBER  factor          4898             0        0.00
## 8       ORIGIN_AIRPORT  factor           628             0        0.00
## 9  DESTINATION_AIRPORT  factor           629             0        0.00
## 10 SCHEDULED_DEPARTURE integer          1321             0        0.00
## 11      DEPARTURE_TIME integer          1441         86153        1.48
## 12     DEPARTURE_DELAY integer          1218         86153        1.48
## 13            TAXI_OUT integer           185         89047        1.53
## 14          WHEELS_OFF integer          1441         89047        1.53
## 15      SCHEDULED_TIME integer           551             6        0.00
## 16        ELAPSED_TIME integer           713        105071        1.81
## 17            AIR_TIME integer           676        105071        1.81
## 18            DISTANCE integer          1363             0        0.00
## 19           WHEELS_ON integer          1441         92513        1.59
## 20             TAXI_IN integer           186         92513        1.59
## 21   SCHEDULED_ARRIVAL integer          1435             0        0.00
## 22        ARRIVAL_TIME integer          1441         92513        1.59
## 23       ARRIVAL_DELAY integer          1241        105071        1.81
## 24            DIVERTED integer             2             0        0.00
## 25           CANCELLED integer             2             0        0.00
## 26 CANCELLATION_REASON  factor             5             0        0.00
## 27    AIR_SYSTEM_DELAY integer           571       4755640       81.72
## 28      SECURITY_DELAY integer           155       4755640       81.72
## 29       AIRLINE_DELAY integer          1068       4755640       81.72
## 30 LATE_AIRCRAFT_DELAY integer           696       4755640       81.72
## 31       WEATHER_DELAY integer           633       4755640       81.72
df.info(airports)
## $data.frame
##       name   size
## 1 airports 0.1 Mb
## 
## $dimensions
##   rows columns
## 1  322       7
## 
## $column.details
##      column   class unique.values missing.count missing.pct
## 1 IATA_CODE  factor           322             0        0.00
## 2   AIRPORT  factor           322             0        0.00
## 3      CITY  factor           308             0        0.00
## 4     STATE  factor            54             0        0.00
## 5   COUNTRY  factor             1             0        0.00
## 6  LATITUDE numeric           320             3        0.93
## 7 LONGITUDE numeric           320             3        0.93
df.info(airlines)
## $data.frame
##       name size
## 1 airlines 0 Mb
## 
## $dimensions
##   rows columns
## 1   14       2
## 
## $column.details
##      column  class unique.values missing.count missing.pct
## 1 IATA_CODE factor            14             0           0
## 2   AIRLINE factor            14             0           0
# Ensuring the same airlines are represented in flights and airlines


airlines
##    IATA_CODE                      AIRLINE
## 1         UA        United Air Lines Inc.
## 2         AA       American Airlines Inc.
## 3         US              US Airways Inc.
## 4         F9       Frontier Airlines Inc.
## 5         B6              JetBlue Airways
## 6         OO        Skywest Airlines Inc.
## 7         AS         Alaska Airlines Inc.
## 8         NK             Spirit Air Lines
## 9         WN       Southwest Airlines Co.
## 10        DL         Delta Air Lines Inc.
## 11        EV  Atlantic Southeast Airlines
## 12        HA       Hawaiian Airlines Inc.
## 13        MQ American Eagle Airlines Inc.
## 14        VX               Virgin America
unique(flights$AIRLINE)
##  [1] AS AA US DL NK UA HA B6 OO EV MQ F9 WN VX
## Levels: AA AS B6 DL EV F9 HA MQ NK OO UA US VX WN
sort(airlines$IATA_CODE) == sort(unique(flights$AIRLINE))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Decsriptive Statistics Cont.

For year 2015 in terms of flights show unique values in terms of flights which were diverted or cancelled along with their day,month and the year which in this case is 2015.Then here we showed flights scheduled time, delayed time and some other features which are delay due secutiy,weather and airsystem delay. Then we calculated the mean, median quantile values of each of the delay used as a variable.

##       
##             1      2      3      4      5      6      7      8      9
##   2015 469968 429191 504312 485151 496993 503897 520718 510536 464946
##       
##            10     11     12
##   2015 486165 467972 479230
## 
##      1      2      3      4      5      6      7      8      9     10 
## 189477 195986 190007 190893 189766 191232 187598 193964 194224 189288 
##     11     12     13     14     15     16     17     18     19     20 
## 190756 190872 195089 188611 192950 195899 191319 191393 193284 195707 
##     21     22     23     24     25     26     27     28     29     30 
## 189413 192725 193560 185017 187317 187387 191920 191401 179441 178771 
##     31 
## 103812
## 
##      1      2      3      4      5      6      7 
## 865543 844600 855897 872521 862209 700545 817764
##    
##           0       1
##   0 5714008   89884
##   1   15187       0
##  SCHEDULED_TIME   ELAPSED_TIME       AIR_TIME      ARRIVAL_DELAY    
##  Min.   : 18.0   Min.   : 14      Min.   :  7.0    Min.   : -87.00  
##  1st Qu.: 85.0   1st Qu.: 82      1st Qu.: 60.0    1st Qu.: -13.00  
##  Median :123.0   Median :118      Median : 94.0    Median :  -5.00  
##  Mean   :141.7   Mean   :137      Mean   :113.5    Mean   :   4.41  
##  3rd Qu.:173.0   3rd Qu.:168      3rd Qu.:144.0    3rd Qu.:   8.00  
##  Max.   :718.0   Max.   :766      Max.   :690.0    Max.   :1971.00  
##  NA's   :6       NA's   :105071   NA's   :105071   NA's   :105071   
##  AIR_SYSTEM_DELAY  SECURITY_DELAY    AIRLINE_DELAY     LATE_AIRCRAFT_DELAY
##  Min.   :   0      Min.   :  0       Min.   :   0      Min.   :   0       
##  1st Qu.:   0      1st Qu.:  0       1st Qu.:   0      1st Qu.:   0       
##  Median :   2      Median :  0       Median :   2      Median :   3       
##  Mean   :  13      Mean   :  0       Mean   :  19      Mean   :  23       
##  3rd Qu.:  18      3rd Qu.:  0       3rd Qu.:  19      3rd Qu.:  29       
##  Max.   :1134      Max.   :573       Max.   :1971      Max.   :1331       
##  NA's   :4755640   NA's   :4755640   NA's   :4755640   NA's   :4755640    
##  WEATHER_DELAY     DEPARTURE_DELAY   ARRIVAL_DELAY.1  
##  Min.   :   0      Min.   : -82.00   Min.   : -87.00  
##  1st Qu.:   0      1st Qu.:  -5.00   1st Qu.: -13.00  
##  Median :   0      Median :  -2.00   Median :  -5.00  
##  Mean   :   3      Mean   :   9.37   Mean   :   4.41  
##  3rd Qu.:   0      3rd Qu.:   7.00   3rd Qu.:   8.00  
##  Max.   :1211      Max.   :1988.00   Max.   :1971.00  
##  NA's   :4755640   NA's   :86153     NA's   :105071
##   Min Q1 Median Q3  Max     Mean       SD       n  Missing
## 1 -82 -5     -2  7 1988 9.370158 37.08094 5819079 24721358
##   Min  Q1 Median Q3  Max     Mean      SD       n  Missing
## 1 -87 -13     -5  8 1971 4.407057 39.2713 5819079 24721358

##                         AIRLINE AirlineCode Mean.Arrival.Delay
## 2          Alaska Airlines Inc.          AS         -0.9765631
## 4          Delta Air Lines Inc.          DL          0.1867536
## 7        Hawaiian Airlines Inc.          HA          2.0230928
## 1        American Airlines Inc.          AA          3.4513721
## 12              US Airways Inc.          US          3.7062088
## 14       Southwest Airlines Co.          WN          4.3749637
## 13               Virgin America          VX          4.7377057
## 11        United Air Lines Inc.          UA          5.4315939
## 10        Skywest Airlines Inc.          OO          5.8456522
## 8  American Eagle Airlines Inc.          MQ          6.4578735
## 5   Atlantic Southeast Airlines          EV          6.5853787
## 3               JetBlue Airways          B6          6.6778608
## 6        Frontier Airlines Inc.          F9         12.5047064
## 9              Spirit Air Lines          NK         14.4717995

##                         AIRLINE AirlineCode Mean.Departure.Delay
## 7        Hawaiian Airlines Inc.          HA            0.4857132
## 2          Alaska Airlines Inc.          AS            1.7858007
## 12              US Airways Inc.          US            6.1411369
## 4          Delta Air Lines Inc.          DL            7.3692542
## 10        Skywest Airlines Inc.          OO            7.8011038
## 5   Atlantic Southeast Airlines          EV            8.7159345
## 1        American Airlines Inc.          AA            8.9008563
## 13               Virgin America          VX            9.0225951
## 8  American Eagle Airlines Inc.          MQ           10.1251882
## 14       Southwest Airlines Co.          WN           10.5819863
## 3               JetBlue Airways          B6           11.5143527
## 6        Frontier Airlines Inc.          F9           13.3508583
## 11        United Air Lines Inc.          UA           14.4354410
## 9              Spirit Air Lines          NK           15.9447659

## 
## AK AL AR AS AZ CA CO CT DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN 
## 19  5  4  1  4 22 10  1  1 17  7  1  5  5  6  7  4  4  4  7  5  1  2 15  8 
## MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA VI VT 
##  5  5  8  8  8  3  1  3  4  3 14  5  3  5  8  3  1  4  3  5 24  5  7  2  1 
## WA WI WV WY 
##  4  8  1  6

## Warning: Removed 191224 rows containing non-finite values (stat_density).

Hypothesis Testing

-The t-test is used best for the sample of data which is unbiased and is normally distributed.

-The data collected here for the completion of this project is basically from the open source kraggle and the source of this dataset is basically department of air transportation USA.

-It is assumed that the data collected is unbiased as for the departure and arrival delay in the flights of plane at different airports there in USA.

-In order to check our assumption of the dataset is homogenous or not we performed leneve test prior to the t-test.

-Levene test is used to find the homogenity of variance between the samples which are under observation(Here the samples are flight departure and arrival delay.)

-It works in such a way that it computes the absolute means of the samples under observation.As this dataset has huge number of entries, so by doing the levene test I got a significance of 2.2e-16.

-Here for this project I am using the significance level of 0.05 and the value i got after performing the levene test is very small or less than 0.05 so it says that the variance cannot be assumed to be equal.

## Levene's Test for Homogeneity of Variance (center = median)
##             Df F value    Pr(>F)    
## group        1   44197 < 2.2e-16 ***
##       11446932                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Paired t-test
## 
## data:  flights$DEPARTURE_DELAY and flights$ARRIVAL_DELAY
## t = 906.89, df = 5714000, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.877221 4.898348
## sample estimates:
## mean of the differences 
##                4.887785

Hypthesis Testing Cont.

-By the results collected from the levene test, we can reject the null hypothesis that the arrival and delay time of the flights in a span of year is same.

-The p-value is used as the alternative to rejection points to provide the smallest level of significance at which the null hypothesis can be rejected.

-The smaller the p-value more chances to reject the null hypothesis.

-Here in this case we are using alpha as 0.05 so if the pvalue is less than alpha it means that we will reject the null hypothesis and if pvalue is greater than the alpha then we will fail to reject the null hypothesis.

-After statistically analyzing our sample we saw the p-value we got is less than 2.2e-16 which is less than alpha 0.05 so we reject the null hypothesis and we will now see which flights have more delay arrival or departure.

-By looking at the boxplot,barplot of difference in the arrival and departure delay timing and the density plots of arrival and departure flight delays and the findings from the hypothesis testing are statistically significant to reject the null hypothesis.

-The 95 %CI (4.877221, 4.898348) did not capture the null hypothesis H0. The means calculated and the barplot showing the difference clearly shows that the departure delay in the flights on the airport there in USA is more than the arrivaldelays in flights.

Discussion

After completing hypothesis testing which includes levene test and t-test on the sample data under observation, we got enough statistical evidence to reject the null hypothesis which was that the arrival and departure flight flight timing are same in a course of one year.

-Now the barcharts, barplot and density comparison results clearly shows that the departure delay of flights in the USA airport is more compared to the arriving flight delays.

References

Following references were used in the completion of this assignment.

  1. https://www.transportation.gov/

  2. https://www.kaggle.com/usdot

  3. MATH1324 Introduction to Statistics Lecture Modules 4-8