The dataset contains all flights departing from Houston airports IAH and HOU to verious various locations.
Administration at the Bureau of Transporation statistics:: http://www.transtats.bts.gov/ DatabaseInfo.asp?DB_ID=120&Link=0
Install package hflights
# install.packages('hflights')Load libraries
library(hflights)
library(ggplot2)
library(MASS)
library(survival)
library(fitdistrplus)flight_data<-hflights
nrow(flight_data)## [1] 227496
ncol(flight_data)## [1] 21
names(flight_data)## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
In this dataset we will take the arrival delay as a response variable
The rational of this design is trying to find if there is a relationship between departure delay and arrival delay using null hypothesis with P < 0.05. we will test upon there is no relationship between the two variables
summary statistics
ArrDelay_y<-na.omit(flight_data$ArrDelay)
summary(ArrDelay_y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -70.000 -8.000 0.000 7.094 11.000 978.000
DepDelay_X<-na.omit(flight_data$DepDelay)
summary(DepDelay_X)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -33.000 -3.000 0.000 9.445 9.000 981.000
Visual graphic description
hist(ArrDelay_y, main = "Arrival Delay", xlab = "ArrDelay ", ylab = "Frequency",
xlim = c(0, 1000), breaks = 20, border = "blue")hist(DepDelay_X, main = "Departure Delay", xlab = "Departure Delay", ylab = "Frequency",
xlim = c(0, 1000), breaks = 20, border = "orange")clean all data
get_na_omittedData=na.omit(flight_data)y=get_na_omittedData$ArrDelay
x=get_na_omittedData$DepDelay
ggplot(get_na_omittedData, aes(x,y))+
geom_point(shape=1) +
geom_jitter(aes(colour = y))+
labs(title = "Arrival Delay vs Departure Delay")+
xlab("Departure Delay") +
ylab("Arrival Delay") +
geom_smooth(method=lm) We will do the t test to look at the p Value and the estimate mean of the two variables
t.test(flight_data$DepDelay,flight_data$ArrDelay)##
## Welch Two Sample t-test
##
## data: flight_data$DepDelay and flight_data$ArrDelay
## t = 26.436, df = 446450, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.176342 2.524892
## sample estimates:
## mean of x mean of y
## 9.444951 7.094334
From the test we can see that the p value is less than 0.05 and it reject the null hypothesis. Thus, the two variables seems to be depended
lets look at the correlation between the two variables
cor.test(get_na_omittedData$ArrDelay,get_na_omittedData$DepDelay,use="complete.obs")##
## Pearson's product-moment correlation
##
## data: get_na_omittedData$ArrDelay and get_na_omittedData$DepDelay
## t = 1189.8, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9286503 0.9297816
## sample estimates:
## cor
## 0.9292181
The correlation of 93% and the small P value shows a strong relationship between the two variables. The analysis indicates that departure delay is related to the cause of arrival delay. Looking at the graph, from the condensed congestion of the jitter points we can also determine that there is strong relationship.