# This is a chunk where you can load the necessary data and packages required to reproduce the report
# You should also include your code required to prepare your data for analysis.
install.packages("maps")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/maps_3.3.0.zip'
Content type 'application/zip' length 3694229 bytes (3.5 MB)
downloaded 3.5 MB
package maps successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\ALI HAIDER\AppData\Local\Temp\Rtmpiawr4d\downloaded_packages
library(ggplot2)
library(lattice)
library(magrittr)
library(dplyr)
library(RColorBrewer)
library(car)
library(granova)
library(readxl)
library(readr)
library("maps")
In the contemporary world traveling via airplanes has become a regular feature of human life. The airplane traveling can be for personal or business purpose and number of domestic and international airlines has emerged in the recent past.So it is very important for the country’s immigation and airport departments to see whether there are flights delay at arrival or departure of the flights. Once the issue with flight delays is figured out then the revelant department might take some positive actions to overcome that delay.
The world has seen ever increasing number of flights everyday and coping with arrival and departure of the flights on time is really hard.USA is having hundreds and thousands of flights departing and arriving everyday and there are delays in the flights.This project is to investigate where actually the delays happening at the airport i-e is it the arrival delay or the departure delay. The need of the hour is to propely implement statistical analysis on the dataset and do hypothesis testing to pinpoint the area of issue and then to figure out wayout to improve or to overcome the delay issue.
As USA is the world super power and driving around one-third of the global economy, that is one of the reasons that there is a huge influx of people traveling to USA. By looking at this aspect I have decided to get the data of the airports of different states of the USA and record the flights with the arrival and departure delay in the respective flights. Normally it is assumed that the delay is at the arrival of the flights and not at the departure. As airports have different procedures at arrival and departure lounge so once we do statistical analysis on the given data, we can come the logical conclusion where the issue is and how it can be sorted out.
The dataset used here in this project is of USA airlines and flights running in their airports. This dataset is based on oRACLE AIRLINE DATA MODEL.It follows the standards set by IATA and the datasets are then recorded in the oracle database which is cross platform compatible. The data collected and stored based on the key features such as passenger services, flight schedules, delays and ticket pricing.The data source collected and used here is from kaggle open source and after going through the brief explanation of that data collection I came to know that they used stratified sampling. As this dataset comprise of airlines,airports and flights so they have divided the data/population in a set of groups based on the airport location and flights on the basis of the frequency throughout the year.
The data set includes flights,airports and airlines operating in the USA.Flights is further subdivided into year,month,week and days variable. Airports include further state and city mentioning the location and last one is airlines which include the airline name along with IATA code issued to airlines by central airline governing body.Summary statistics of each of the above three mention set is shown in tabular form along with size of each file in MBs.Sorting is done in such a way to make sure that same airlines are shown in flights and airlines.
airlines <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flight-delays/airlines.csv")
airports <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/airports.csv")
flights <- read.csv("C:/Users/ALI HAIDER/Desktop/stats/flights.csv")
#Summary stats
# This section shows the dataset with key subvariables along with size of file in MBs.
df.info <- function(x) {
dat <- as.character(substitute(x)) ##data frame name
size <- format(object.size(x), units="Mb") ##size of data frame in Mb
##column information
column.info <- data.frame( column = names(sapply(x, class)),
class = sapply(x, class),
unique.values = sapply(x, function(y) length(unique(y))),
missing.count = colSums(is.na(x)),
missing.pct = round(colSums(is.na(x)) / nrow(x) * 100, 2))
row.names(column.info) <- 1:nrow(column.info)
list(data.frame = data.frame(name=dat, size=size),
dimensions = data.frame(rows=nrow(x), columns=ncol(x)),
column.details = column.info)
}
df.info(flights)
$`data.frame`
$dimensions
$column.details
df.info(airports)
$`data.frame`
$dimensions
$column.details
df.info(airlines)
$`data.frame`
$dimensions
$column.details
# Ensuring the same airlines are represented in flights and airlines
airlines
unique(flights$AIRLINE)
[1] AS AA US DL NK UA HA B6 OO EV MQ F9 WN VX
Levels: AA AS B6 DL EV F9 HA MQ NK OO UA US VX WN
sort(airlines$IATA_CODE) == sort(unique(flights$AIRLINE))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
For year 2015 in terms of flights show unique values in terms of flights which were diverted or cancelled along with their day,month and the year which in this case is 2015.Then here i showed flights scheduled time, delayed time and some other features which are delay due secutiy,weather and airsystem delay. Then i calculated the mean, median quantile values of each of the delay used as a variable.Data summarization of departure and arrival delay of flights at the airports in USA.Boxplot is plotted with main delay metrics.Barplot is plotted with delays metrics of 0-90 percentile and delay include aircraft delay, weather delay and airline delay.Barplot is plotted to show the right tail which is 90-100 percentile of these delays which are mentioned earlier.Barplot to show the average delay time in minutes in terms of airlines.
# Brief summary showing four tables of number of flights in month, in a day,in a week and total diverted and cancelled flights
table(flights$YEAR, flights$MONTH)
1 2 3 4 5 6 7 8 9 10 11 12
2015 469968 429191 504312 485151 496993 503897 520718 510536 464946 486165 467972 479230
table(flights$DAY)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
189477 195986 190007 190893 189766 191232 187598 193964 194224 189288 190756 190872 195089 188611 192950
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
195899 191319 191393 193284 195707 189413 192725 193560 185017 187317 187387 191920 191401 179441 178771
31
103812
table(flights$DAY_OF_WEEK)
1 2 3 4 5 6 7
865543 844600 855897 872521 862209 700545 817764
table(flights$DIVERTED, flights$CANCELLED)
0 1
0 5714008 89884
1 15187 0
# Summary of flight time and Delays
keep <- c("SCHEDULED_TIME","ELAPSED_TIME","AIR_TIME","ARRIVAL_DELAY",
"AIR_SYSTEM_DELAY","SECURITY_DELAY","AIRLINE_DELAY",
"LATE_AIRCRAFT_DELAY","WEATHER_DELAY","DEPARTURE_DELAY","ARRIVAL_DELAY")
summary(flights[ ,keep])
SCHEDULED_TIME ELAPSED_TIME AIR_TIME ARRIVAL_DELAY AIR_SYSTEM_DELAY SECURITY_DELAY
Min. : 18.0 Min. : 14 Min. : 7.0 Min. : -87.00 Min. : 0 Min. : 0
1st Qu.: 85.0 1st Qu.: 82 1st Qu.: 60.0 1st Qu.: -13.00 1st Qu.: 0 1st Qu.: 0
Median :123.0 Median :118 Median : 94.0 Median : -5.00 Median : 2 Median : 0
Mean :141.7 Mean :137 Mean :113.5 Mean : 4.41 Mean : 13 Mean : 0
3rd Qu.:173.0 3rd Qu.:168 3rd Qu.:144.0 3rd Qu.: 8.00 3rd Qu.: 18 3rd Qu.: 0
Max. :718.0 Max. :766 Max. :690.0 Max. :1971.00 Max. :1134 Max. :573
NA's :6 NA's :105071 NA's :105071 NA's :105071 NA's :4755640 NA's :4755640
AIRLINE_DELAY LATE_AIRCRAFT_DELAY WEATHER_DELAY DEPARTURE_DELAY ARRIVAL_DELAY.1
Min. : 0 Min. : 0 Min. : 0 Min. : -82.00 Min. : -87.00
1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: -5.00 1st Qu.: -13.00
Median : 2 Median : 3 Median : 0 Median : -2.00 Median : -5.00
Mean : 19 Mean : 23 Mean : 3 Mean : 9.37 Mean : 4.41
3rd Qu.: 19 3rd Qu.: 29 3rd Qu.: 0 3rd Qu.: 7.00 3rd Qu.: 8.00
Max. :1971 Max. :1331 Max. :1211 Max. :1988.00 Max. :1971.00
NA's :4755640 NA's :4755640 NA's :4755640 NA's :86153 NA's :105071
# Data summarization of Fights Departure Delay
flights%>%
summarise(
Min = min(DEPARTURE_DELAY, na.rm = TRUE),
Q1 = quantile(DEPARTURE_DELAY, probs = .25, na.rm = TRUE),
Median = median(DEPARTURE_DELAY, na.rm = TRUE),
Q3 = quantile(DEPARTURE_DELAY, probs = .75, na.rm = TRUE),
Max = max(DEPARTURE_DELAY, na.rm = TRUE),
Mean = mean(DEPARTURE_DELAY, na.rm = TRUE),
SD = sd(DEPARTURE_DELAY, na.rm = TRUE),
n = n(),
Missing = sum(is.na(flights))
)
# Data summarization of Fights Arrival Delay
flights%>%
summarise(
Min = min(ARRIVAL_DELAY, na.rm = TRUE),
Q1 = quantile(ARRIVAL_DELAY, probs = .25, na.rm = TRUE),
Median = median(ARRIVAL_DELAY, na.rm = TRUE),
Q3 = quantile(ARRIVAL_DELAY, probs = .75, na.rm = TRUE),
Max = max(ARRIVAL_DELAY, na.rm = TRUE),
Mean = mean(ARRIVAL_DELAY, na.rm = TRUE),
SD = sd(ARRIVAL_DELAY, na.rm = TRUE),
n = n(),
Missing = sum(is.na(flights))
)
boxplot(flights[ ,keep[4:length(keep)]], col=4:length(keep), main="Boxplot of main delays matrics")
# create barplot of delay metrics, 0-90th percentile
p.90 <- function(x){
p <- seq(0, 0.9, 0.1) ## we will look at 0 to 90th percentile of positive delays
quantile(x[x > 0], probs=p, na.rm=T)
}
barplot(p.90(flights$LATE_AIRCRAFT_DELAY),
main = "Flight Delay Distributions",
xlab = "Percentile range value distribution",
ylab = "Delay time in minutes",
xlim = c(0,length(p.90(flights$LATE_AIRCRAFT_DELAY))*3.5),
col = "red")
barplot(c(rep(0,length(p.90(flights$LATE_AIRCRAFT_DELAY))), p.90(flights$DEPARTURE_DELAY)),
col = "green",
add = TRUE)
barplot(c(rep(0,length(p.90(flights$LATE_AIRCRAFT_DELAY))*2), p.90(flights$ARRIVAL_DELAY)),
col = "blue",
add = TRUE)
legend(0.5, 95, c("Aircraft delay", "Departure Delay", "Arrival Delay"),
fill=c("red", "green", "blue"), cex=0.65)
# create barplot of delay metrics, right tail outliers (90-100th percentile)
right.tail <- function(x){
p <- seq(0.9, 1, 0.01)
quantile(x[x > 0], probs=p, na.rm=T)
}
barplot(right.tail(flights$LATE_AIRCRAFT_DELAY),
main = "Flight Delay Distribution",
xlab = "Percentile range value distribution (90-100)",
ylab = "Delay in minutes",
xlim = c(0,length(right.tail(flights$LATE_AIRCRAFT_DELAY))*3.5),
ylim = c(0, max(c(flights$LATE_AIRCRAFT_DELAY, flights$DEPARTURE_DELAY, flights$ARRIVAL_DELAY), na.rm=T)),
col = "red")
barplot(c(rep(0,length(right.tail(flights$LATE_AIRCRAFT_DELAY))), right.tail(flights$DEPARTURE_DELAY)),
col = "green",
add = TRUE)
barplot(c(rep(0,length(right.tail(flights$LATE_AIRCRAFT_DELAY))*2), right.tail(flights$ARRIVAL_DELAY)),
col = "blue",
add = TRUE)
legend(0.5, 1300, c("Aircraft delay", "Departure Delay", "Arrival Delay"),
fill=c("red", "green", "blue"), cex=0.65)
# Visualizing average arrival delay times (of delayed flights) by airline
airline.avg.delay <- aggregate(flights$ARRIVAL_DELAY, by=list(flights$AIRLINE), mean, na.rm=T)
names(airline.avg.delay) <- c("AirlineCode", "Mean.Arrival.Delay")
airline.avg.delay <- merge(airline.avg.delay, airlines, by.x="AirlineCode", by.y="IATA_CODE", all.x=TRUE)
airline.avg.delay <- airline.avg.delay[order(airline.avg.delay$Mean.Arrival.Delay), ]
airline.avg.delay <- airline.avg.delay[ ,c(3,1,2)]
airline.avg.delay
barplot(airline.avg.delay$Mean.Arrival.Delay,
names.arg=airline.avg.delay$AirlineCode,
col="red",
main="Average delay time in terms of arrival by airlines",
xlab="IATA Airline Unique code",
ylab="Mean Arrival Delay")
# Visualizing average Departure delay times (of delayed flights) by airline
airline.avg.delay <- aggregate(flights$DEPARTURE_DELAY, by=list(flights$AIRLINE), mean, na.rm=T)
names(airline.avg.delay) <- c("AirlineCode", "Mean.Departure.Delay")
airline.avg.delay <- merge(airline.avg.delay, airlines, by.x="AirlineCode", by.y="IATA_CODE", all.x=TRUE)
airline.avg.delay <- airline.avg.delay[order(airline.avg.delay$Mean.Departure.Delay), ]
airline.avg.delay <- airline.avg.delay[ ,c(3,1,2)]
airline.avg.delay
barplot(airline.avg.delay$Mean.Departure.Delay,
names.arg=airline.avg.delay$AirlineCode,
col="red",
main="Average delay time in terms of departure by airlines",
xlab="IATA Airline Unique code",
ylab="Mean Departure Delay")
# Visualizing the location of airports represented
table(airports$STATE)
AK AL AR AS AZ CA CO CT DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV
19 5 4 1 4 22 10 1 1 17 7 1 5 5 6 7 4 4 4 7 5 1 2 15 8 5 5 8 8 8 3 1 3 4 3
NY OH OK OR PA PR RI SC SD TN TX UT VA VI VT WA WI WV WY
14 5 3 5 8 3 1 4 3 5 24 5 7 2 1 4 8 1 6
top12 <- as.data.frame(table(airports$STATE))
top12 <- top12[order(top12$Freq, decreasing=T), ][1:12, ]
barplot(top12$Freq,
names.arg = top12$Var1,
col = "dark blue",
main = "Number of Airports by State (top 12)",
xlab = "Number of airports against states",
ylab = "State",
horiz=T)
#Density Comparision
flights.density <- data.frame(dens = c(flights$DEPARTURE_DELAY, flights$ARRIVAL_DELAY), lines = rep(c("Departure Delay","Arrival Delay")))
#Plot
ggplot(flights.density,
aes(x = dens, fill = lines)
) + geom_density(alpha = 0.5) +
ggtitle("Density chart for Departure and Arrival delay of airline flights") +
labs(x="Delay time",y="Density")
# Flight Arrival and departure delay Difference
diff <- flights$DEPARTURE_DELAY - flights$ARRIVAL_DELAY
diff <- diff[diff < 10 & diff > -10]
#barplot
barplot(
diff,
horiz = TRUE,
col=ifelse(diff>0,"#da1000","#bbbbbb"),
border = NA,
main = "Delay difference between Departure and arrival of flights",
xlab = "Time in minutes",
xlim = c(-10,10)
)
legend(
"topleft",
legend = c("Departure Delay > Arrival Delay", "Departure Delay < Arrival Delay"),
fill = c("#da1000","#bbbbbb"),
bty = "n",
border = FALSE,
text.font=c(3,1)
)
# map
map("world")
title("Airports")
points(airports$LONGITUDE, airports$LATITUDE, col="red", cex=0.75)
map("usa")
title("Airports")
points(airports$LONGITUDE, airports$LATITUDE, col="red", cex=0.75)
This section deals with the hypothesis testing and i will be explaining here the hypothesis tests and the values we got to conclude on the assumption I stated in the introduction which is the arrival and delay time of the flights at the USA airports is same.The t-test is used best for the sample of data which is unbiased and is normally distributed. The data collected here for the completion of this project is basically from the open source kraggle and the source of this dataset is basically department of air transportation USA. It is assumed that the data collected is unbiased as for the departure and arrival delay in the flights of plane at different airports there in USA. In order to check our assumption of the dataset is homogenous or not we performed leneve test prior to the t-test. Levene test is used to find the homogenity of variance between the samples which are under observation(Here the samples are flight departure and arrival delay.) It works in such a way that it computes the absolute means of the samples under observation.As this dataset has huge number of entries, so by doing the levene test I got a significance of 2.2e-16. Here for this project I am using the significance level of 0.05 and the value i got after performing the levene test is very small or less than 0.05 so it says that the variance cannot be assumed to be equal.
#Levene Test
y <- c(flights$DEPARTURE_DELAY, flights$ARRIVAL_DELAY)
group <- as.factor(c(rep(1, length(flights$DEPARTURE_DELAY)), rep(2,length(flights$ARRIVAL_DELAY))))
leveneTest(y, group)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 44197 < 2.2e-16 ***
11446932
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
#t-test
t.test(flights$DEPARTURE_DELAY,flights$ARRIVAL_DELAY,var.equal = FALSE,paired=TRUE,alternative = "two.sided")
Paired t-test
data: flights$DEPARTURE_DELAY and flights$ARRIVAL_DELAY
t = 906.89, df = 5714000, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.877221 4.898348
sample estimates:
mean of the differences
4.887785
This test carries two conditions which need to look at.Null hypothesis is that there is no difference between the variances of the samples under consideration or in simple words the difference between the average mean of the two samples( arrival and departure delay timing) is zero. and the other one is the alternate hypothesis which says that there is statistical significant in the variance between the groups. By the results collected from the levene test, we can reject the null hypothesis that the arrival and delay time of the flights in a span of year is same.The p-value is used as the alternative to rejection points to provide the smallest level of significance at which the null hypothesis can be rejected. The smaller the p-value more chances to reject the null hypothesis. Here in this case we are using alpha as 0.05 so if the pvalue is less than alpha it means that we will reject the null hypothesis and if pvalue is greater than the alpha then we will fail to reject the null hypothesis. After statistically analyzing our sample we saw the p-value we got is less than 2.2e-16 which is less than alpha 0.05 so we reject the null hypothesis and we will now see which flights have more delay arrival or departure. By looking at the boxplot,barplot of difference in the arrival and departure delay timing and the density plots of arrival and departure flight delays and the findings from the hypothesis testing are statistically significant to reject the null hypothesis. The 95 %CI (4.877221, 4.898348) did not capture the null hypothesis H0. The means calculated and the barplot showing the difference clearly shows that the departure delay in the flights on the airport there in USA is more than the arrivaldelays in flights.
\[H_0: \mu_1 = \mu_2 \]
\[H_A: \mu_1 \ne \mu_2\]
\[S = \sum^n_{i = 1}d^2_i\]
After completing hypothesis testing which includes levene test and t-test on the sample data under observation, we got enough statistical evidence to reject the null hypothesis which was that the arrival and departure flight flight timing are same in a course of one year.Now the barcharts, barplot and density comparison results clearly shows that the departure delay of flights in the USA airport is more compared to the arriving flight delays.This statistical analysis is important because once the airport and immigration depart knows that the delay in flights is in the departure side so they can either speed up their security related checks or add more work force or can improve the machinery which in turns speed up the process.
Following references are used in order to complete this assignment: 1) Introduction to Statistics MATH 1324 Module 4-6 2) https://rpubs.com/marschmi/RMarkdown 3) https://bookdown.org/yihui/rmarkdown/slidy-presentation.html