R Final Project

Data Description

The dataset contains all flights departing from Houston airports IAH and HOU to verious various locations.

Administration at the Bureau of Transporation statistics:: http://www.transtats.bts.gov/ DatabaseInfo.asp?DB_ID=120&Link=0

Software Preparation

Install package hflights

# install.packages('hflights')

Load libraries

library(hflights)
library(ggplot2)
library(MASS)
library(survival)
library(fitdistrplus)

Load Data To The Envirement

flight_data<-hflights
nrow(flight_data)

## [1] 227496

ncol(flight_data)

## [1] 21

names(flight_data)

##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"

Response Variables

In this dataset we will take the arrival delay as a response variable

Rationale

The rational of this design is trying to find if there is a relationship between departure delay and arrival delay using null hypothesis with P < 0.05. we will test upon there is no relationship between the two variables

Variables description

summary statistics

ArrDelay_y<-na.omit(flight_data$ArrDelay)
summary(ArrDelay_y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -70.000  -8.000   0.000   7.094  11.000 978.000

DepDelay_X<-na.omit(flight_data$DepDelay)
summary(DepDelay_X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -33.000  -3.000   0.000   9.445   9.000 981.000

Visual graphic description

hist(ArrDelay_y, main = "Arrival Delay", xlab = "ArrDelay ", ylab = "Frequency",
                                         xlim = c(0, 1000), breaks = 20, border = "blue")

hist(DepDelay_X, main = "Departure Delay", xlab = "Departure Delay", ylab = "Frequency",
                                         xlim = c(0, 1000), breaks = 20, border = "orange")

clean all data

get_na_omittedData=na.omit(flight_data)

Scatter plot

y=get_na_omittedData$ArrDelay
x=get_na_omittedData$DepDelay

ggplot(get_na_omittedData, aes(x,y))+
    geom_point(shape=1) + 
    geom_jitter(aes(colour = y))+
    labs(title = "Arrival Delay vs Departure Delay")+
    xlab("Departure Delay") +   
    ylab("Arrival Delay") +
    geom_smooth(method=lm)

Analysis

We will do the t test to look at the p Value and the estimate mean of the two variables

t.test(flight_data$DepDelay,flight_data$ArrDelay)

## 
##  Welch Two Sample t-test
## 
## data:  flight_data$DepDelay and flight_data$ArrDelay
## t = 26.436, df = 446450, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.176342 2.524892
## sample estimates:
## mean of x mean of y 
##  9.444951  7.094334

From the test we can see that the p value is less than 0.05 and it reject the null hypothesis. Thus, the two variables seems to be depended

lets look at the correlation between the two variables

cor.test(get_na_omittedData$ArrDelay,get_na_omittedData$DepDelay,use="complete.obs")

## 
##  Pearson's product-moment correlation
## 
## data:  get_na_omittedData$ArrDelay and get_na_omittedData$DepDelay
## t = 1189.8, df = 223870, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9286503 0.9297816
## sample estimates:
##       cor 
## 0.9292181

Conclusion

The correlation of 93% and the small P value shows a strong relationship between the two variables. The analysis indicates that departure delay is related to the cause of arrival delay. Looking at the graph, from the condensed congestion of the jitter points we can also determine that there is strong relationship.