ALI HAIDER (s3622366) and Venkata Mallikarjun (s3642400)
October 28, 2018
Rpubs link: http://rpubs.com/ALI_HAIDER/434071
Dataset collection is from Kraggle Opensource and the link is: https://www.kaggle.com/usdot
In the contemporary world traveling via airplanes has become a regular feature of human life.
-The airplane traveling can be for personal or business purpose and number of domestic and international airlines has emerged in the recent past.
-It is very important for the country’s immigation and airport departments to see whether there are flights delay at arrival or departure of the flights.
The world has seen ever increasing number of flights everyday and coping with arrival and departure of the flights on time is really hard.
-USA is having hundreds and thousands of flights departing and arriving everyday and there are delays in the flights.
-This project is to investigate where actually the delays happening at the airport i-e is it the arrival delay or the departure delay.
-The need of the hour is to propely implement statistical analysis on the dataset and do hypothesis testing to pinpoint the area of issue and then to figure out wayout to improve or to overcome the delay issue.
As USA is the world super power and driving around one-third of the global economy, that is one of the reasons that there is a huge influx of people traveling to USA.
-By looking at this aspect I have decided to get the data of the airports of different states of the USA and record the flights with the arrival and departure delay in the respective flights.
-Normally it is assumed that the delay is at the arrival of the flights and not at the departure.
-As airports have different procedures at arrival and departure lounge so once we do statistical analysis on the given data, we can come the logical conclusion where the issue is and how it can be sorted out.
airlines <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flight-delays/airlines.csv")
airports <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/airports.csv")
flights <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flights.csv")
#Summary stats
# This section shows the dataset with key subvariables along with size of file in MBs.
df.info <- function(x) {
dat <- as.character(substitute(x)) ##data frame name
size <- format(object.size(x), units="Mb") ##size of data frame in Mb
##column information
column.info <- data.frame( column = names(sapply(x, class)),
class = sapply(x, class),
unique.values = sapply(x, function(y) length(unique(y))),
missing.count = colSums(is.na(x)),
missing.pct = round(colSums(is.na(x)) / nrow(x) * 100, 2))
row.names(column.info) <- 1:nrow(column.info)
list(data.frame = data.frame(name=dat, size=size),
dimensions = data.frame(rows=nrow(x), columns=ncol(x)),
column.details = column.info)
}
df.info(flights)## $data.frame
## name size
## 1 flights 688.5 Mb
##
## $dimensions
## rows columns
## 1 5819079 31
##
## $column.details
## column class unique.values missing.count missing.pct
## 1 YEAR integer 1 0 0.00
## 2 MONTH integer 12 0 0.00
## 3 DAY integer 31 0 0.00
## 4 DAY_OF_WEEK integer 7 0 0.00
## 5 AIRLINE factor 14 0 0.00
## 6 FLIGHT_NUMBER integer 6952 0 0.00
## 7 TAIL_NUMBER factor 4898 0 0.00
## 8 ORIGIN_AIRPORT factor 628 0 0.00
## 9 DESTINATION_AIRPORT factor 629 0 0.00
## 10 SCHEDULED_DEPARTURE integer 1321 0 0.00
## 11 DEPARTURE_TIME integer 1441 86153 1.48
## 12 DEPARTURE_DELAY integer 1218 86153 1.48
## 13 TAXI_OUT integer 185 89047 1.53
## 14 WHEELS_OFF integer 1441 89047 1.53
## 15 SCHEDULED_TIME integer 551 6 0.00
## 16 ELAPSED_TIME integer 713 105071 1.81
## 17 AIR_TIME integer 676 105071 1.81
## 18 DISTANCE integer 1363 0 0.00
## 19 WHEELS_ON integer 1441 92513 1.59
## 20 TAXI_IN integer 186 92513 1.59
## 21 SCHEDULED_ARRIVAL integer 1435 0 0.00
## 22 ARRIVAL_TIME integer 1441 92513 1.59
## 23 ARRIVAL_DELAY integer 1241 105071 1.81
## 24 DIVERTED integer 2 0 0.00
## 25 CANCELLED integer 2 0 0.00
## 26 CANCELLATION_REASON factor 5 0 0.00
## 27 AIR_SYSTEM_DELAY integer 571 4755640 81.72
## 28 SECURITY_DELAY integer 155 4755640 81.72
## 29 AIRLINE_DELAY integer 1068 4755640 81.72
## 30 LATE_AIRCRAFT_DELAY integer 696 4755640 81.72
## 31 WEATHER_DELAY integer 633 4755640 81.72
df.info(airports)## $data.frame
## name size
## 1 airports 0.1 Mb
##
## $dimensions
## rows columns
## 1 322 7
##
## $column.details
## column class unique.values missing.count missing.pct
## 1 IATA_CODE factor 322 0 0.00
## 2 AIRPORT factor 322 0 0.00
## 3 CITY factor 308 0 0.00
## 4 STATE factor 54 0 0.00
## 5 COUNTRY factor 1 0 0.00
## 6 LATITUDE numeric 320 3 0.93
## 7 LONGITUDE numeric 320 3 0.93
df.info(airlines)## $data.frame
## name size
## 1 airlines 0 Mb
##
## $dimensions
## rows columns
## 1 14 2
##
## $column.details
## column class unique.values missing.count missing.pct
## 1 IATA_CODE factor 14 0 0
## 2 AIRLINE factor 14 0 0
# Ensuring the same airlines are represented in flights and airlines
airlines## IATA_CODE AIRLINE
## 1 UA United Air Lines Inc.
## 2 AA American Airlines Inc.
## 3 US US Airways Inc.
## 4 F9 Frontier Airlines Inc.
## 5 B6 JetBlue Airways
## 6 OO Skywest Airlines Inc.
## 7 AS Alaska Airlines Inc.
## 8 NK Spirit Air Lines
## 9 WN Southwest Airlines Co.
## 10 DL Delta Air Lines Inc.
## 11 EV Atlantic Southeast Airlines
## 12 HA Hawaiian Airlines Inc.
## 13 MQ American Eagle Airlines Inc.
## 14 VX Virgin America
unique(flights$AIRLINE)## [1] AS AA US DL NK UA HA B6 OO EV MQ F9 WN VX
## Levels: AA AS B6 DL EV F9 HA MQ NK OO UA US VX WN
sort(airlines$IATA_CODE) == sort(unique(flights$AIRLINE))## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
For year 2015 in terms of flights show unique values in terms of flights which were diverted or cancelled along with their day,month and the year which in this case is 2015.Then here we showed flights scheduled time, delayed time and some other features which are delay due secutiy,weather and airsystem delay. Then we calculated the mean, median quantile values of each of the delay used as a variable.
Data summarization of departure and arrival delay of flights at the airports in USA.
Boxplot is plotted with main delay metrics.
Barplot is plotted with delays metrics of 0-90 percentile and delay include aircraft delay, weather delay and airline delay.
Barplot is plotted to show the right tail which is 9–100 percentile of these delays which are mentioned earlier.
Barplot to show the average delay time in minutes in terms of airlines.
##
## 1 2 3 4 5 6 7 8 9
## 2015 469968 429191 504312 485151 496993 503897 520718 510536 464946
##
## 10 11 12
## 2015 486165 467972 479230
##
## 1 2 3 4 5 6 7 8 9 10
## 189477 195986 190007 190893 189766 191232 187598 193964 194224 189288
## 11 12 13 14 15 16 17 18 19 20
## 190756 190872 195089 188611 192950 195899 191319 191393 193284 195707
## 21 22 23 24 25 26 27 28 29 30
## 189413 192725 193560 185017 187317 187387 191920 191401 179441 178771
## 31
## 103812
##
## 1 2 3 4 5 6 7
## 865543 844600 855897 872521 862209 700545 817764
##
## 0 1
## 0 5714008 89884
## 1 15187 0
## SCHEDULED_TIME ELAPSED_TIME AIR_TIME ARRIVAL_DELAY
## Min. : 18.0 Min. : 14 Min. : 7.0 Min. : -87.00
## 1st Qu.: 85.0 1st Qu.: 82 1st Qu.: 60.0 1st Qu.: -13.00
## Median :123.0 Median :118 Median : 94.0 Median : -5.00
## Mean :141.7 Mean :137 Mean :113.5 Mean : 4.41
## 3rd Qu.:173.0 3rd Qu.:168 3rd Qu.:144.0 3rd Qu.: 8.00
## Max. :718.0 Max. :766 Max. :690.0 Max. :1971.00
## NA's :6 NA's :105071 NA's :105071 NA's :105071
## AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY LATE_AIRCRAFT_DELAY
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 2 Median : 0 Median : 2 Median : 3
## Mean : 13 Mean : 0 Mean : 19 Mean : 23
## 3rd Qu.: 18 3rd Qu.: 0 3rd Qu.: 19 3rd Qu.: 29
## Max. :1134 Max. :573 Max. :1971 Max. :1331
## NA's :4755640 NA's :4755640 NA's :4755640 NA's :4755640
## WEATHER_DELAY DEPARTURE_DELAY ARRIVAL_DELAY.1
## Min. : 0 Min. : -82.00 Min. : -87.00
## 1st Qu.: 0 1st Qu.: -5.00 1st Qu.: -13.00
## Median : 0 Median : -2.00 Median : -5.00
## Mean : 3 Mean : 9.37 Mean : 4.41
## 3rd Qu.: 0 3rd Qu.: 7.00 3rd Qu.: 8.00
## Max. :1211 Max. :1988.00 Max. :1971.00
## NA's :4755640 NA's :86153 NA's :105071
## Min Q1 Median Q3 Max Mean SD n Missing
## 1 -82 -5 -2 7 1988 9.370158 37.08094 5819079 24721358
## Min Q1 Median Q3 Max Mean SD n Missing
## 1 -87 -13 -5 8 1971 4.407057 39.2713 5819079 24721358
## AIRLINE AirlineCode Mean.Arrival.Delay
## 2 Alaska Airlines Inc. AS -0.9765631
## 4 Delta Air Lines Inc. DL 0.1867536
## 7 Hawaiian Airlines Inc. HA 2.0230928
## 1 American Airlines Inc. AA 3.4513721
## 12 US Airways Inc. US 3.7062088
## 14 Southwest Airlines Co. WN 4.3749637
## 13 Virgin America VX 4.7377057
## 11 United Air Lines Inc. UA 5.4315939
## 10 Skywest Airlines Inc. OO 5.8456522
## 8 American Eagle Airlines Inc. MQ 6.4578735
## 5 Atlantic Southeast Airlines EV 6.5853787
## 3 JetBlue Airways B6 6.6778608
## 6 Frontier Airlines Inc. F9 12.5047064
## 9 Spirit Air Lines NK 14.4717995
## AIRLINE AirlineCode Mean.Departure.Delay
## 7 Hawaiian Airlines Inc. HA 0.4857132
## 2 Alaska Airlines Inc. AS 1.7858007
## 12 US Airways Inc. US 6.1411369
## 4 Delta Air Lines Inc. DL 7.3692542
## 10 Skywest Airlines Inc. OO 7.8011038
## 5 Atlantic Southeast Airlines EV 8.7159345
## 1 American Airlines Inc. AA 8.9008563
## 13 Virgin America VX 9.0225951
## 8 American Eagle Airlines Inc. MQ 10.1251882
## 14 Southwest Airlines Co. WN 10.5819863
## 3 JetBlue Airways B6 11.5143527
## 6 Frontier Airlines Inc. F9 13.3508583
## 11 United Air Lines Inc. UA 14.4354410
## 9 Spirit Air Lines NK 15.9447659
##
## AK AL AR AS AZ CA CO CT DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN
## 19 5 4 1 4 22 10 1 1 17 7 1 5 5 6 7 4 4 4 7 5 1 2 15 8
## MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA VI VT
## 5 5 8 8 8 3 1 3 4 3 14 5 3 5 8 3 1 4 3 5 24 5 7 2 1
## WA WI WV WY
## 4 8 1 6
## Warning: Removed 191224 rows containing non-finite values (stat_density).
-The t-test is used best for the sample of data which is unbiased and is normally distributed.
-The data collected here for the completion of this project is basically from the open source kraggle and the source of this dataset is basically department of air transportation USA.
-It is assumed that the data collected is unbiased as for the departure and arrival delay in the flights of plane at different airports there in USA.
-In order to check our assumption of the dataset is homogenous or not we performed leneve test prior to the t-test.
-Levene test is used to find the homogenity of variance between the samples which are under observation(Here the samples are flight departure and arrival delay.)
-It works in such a way that it computes the absolute means of the samples under observation.As this dataset has huge number of entries, so by doing the levene test I got a significance of 2.2e-16.
-Here for this project I am using the significance level of 0.05 and the value i got after performing the levene test is very small or less than 0.05 so it says that the variance cannot be assumed to be equal.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 44197 < 2.2e-16 ***
## 11446932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Paired t-test
##
## data: flights$DEPARTURE_DELAY and flights$ARRIVAL_DELAY
## t = 906.89, df = 5714000, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.877221 4.898348
## sample estimates:
## mean of the differences
## 4.887785
-By the results collected from the levene test, we can reject the null hypothesis that the arrival and delay time of the flights in a span of year is same.
-The p-value is used as the alternative to rejection points to provide the smallest level of significance at which the null hypothesis can be rejected.
-The smaller the p-value more chances to reject the null hypothesis.
-Here in this case we are using alpha as 0.05 so if the pvalue is less than alpha it means that we will reject the null hypothesis and if pvalue is greater than the alpha then we will fail to reject the null hypothesis.
-After statistically analyzing our sample we saw the p-value we got is less than 2.2e-16 which is less than alpha 0.05 so we reject the null hypothesis and we will now see which flights have more delay arrival or departure.
-By looking at the boxplot,barplot of difference in the arrival and departure delay timing and the density plots of arrival and departure flight delays and the findings from the hypothesis testing are statistically significant to reject the null hypothesis.
-The 95 %CI (4.877221, 4.898348) did not capture the null hypothesis H0. The means calculated and the barplot showing the difference clearly shows that the departure delay in the flights on the airport there in USA is more than the arrivaldelays in flights.
After completing hypothesis testing which includes levene test and t-test on the sample data under observation, we got enough statistical evidence to reject the null hypothesis which was that the arrival and departure flight flight timing are same in a course of one year.
-Now the barcharts, barplot and density comparison results clearly shows that the departure delay of flights in the USA airport is more compared to the arriving flight delays.
This statistical analysis is important because once the airport and immigration depart knows that the delay in flights is in the departure side so they can either speed up their security related checks or add more work force or can improve the machinery which in turns speed up the process.
These statistical results provide enough evidance that the flights departing from the USA airports are delayed and the airport authorities and department of immigration need to sit together to chalkout the plan which could overcome the delay in the departure flights in particular and overall improvement in the airport system in general.
Following references were used in the completion of this assignment.
MATH1324 Introduction to Statistics Lecture Modules 4-8