In this final paper I analyzed a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011.
# loading the dataset into the current session
library('hflights')
## Warning: package 'hflights' was built under R version 3.2.3
# finding out about the number of columns
ncol(hflights)
## [1] 21
# finding out about the number of rows
nrow(hflights)
## [1] 227496
I got inspired to use this dataset from the description of the final task of this class’s assignment (http://www.rpubs.com/YaRrr/Winter1516FinalPaper). Then I found out that this dataset existed as a package called ‘hflights’ already, so I downloaded it directly from R. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0. There were 227496 rows and 21 columns in the dataset. The names of the columns were
names(hflights)
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
The columns mean (information retrieved from http://www.inside-r.org/node/224880 on 2016/02/03):
Question 1. What is the mean (including standard deviation) and median distance of the flights? What does the distribution of the variable ‘Distance’ look like?
Question 2. Comparing HOU and IAH, from which airport are there on average leaving longer flights?
Question 3. Was there a significant correlation between the distance a flight covered and its arrival delay?
Question 4. Was there a difference between arrival delays on weekends and arrival delays during the week?
Question 5. Compare the departure delays between the seasons of year.
Question 6. Compare the air time as a function of the distance between the two carriers that carry out the most flights in this data set. What implications do the findings have?
First I calculate the mean, sd and median for the variable ‘Distance’.
#TASK 2.
mean(hflights$Distance)
## [1] 787.7832
sd(hflights$Distance)
## [1] 453.6806
median(hflights$Distance)
## [1] 809
The mean distance was 787.78 miles (SD = 453.68), the median distance was 809 miles.
Next I create a histogram of the variable ‘Distance’ to get a graphic overview of the distribution.
#TASK 7.
hist(hflights$Distance,
main = 'Distribution of flight distances',
xlab = 'Distance [miles]',
col = 'skyblue',
border = 'skyblue4')
#I add a line indicating the mean of the group.
abline(v=mean(hflights$Distance),
col = 'red',
lwd = 3)
# I add a line indicating the median of the group.
abline(v= median(hflights$Distance),
col = 'purple',
lty = 5,
lwd = 3)
legend('topright',
legend = c('mean', 'median'),
lty = c(1,5),
lwd = c(3,3),
col = c('red', 'purple'))
Most flights were shorter than 2000 miles, although there were a few which are around 4000 miles. The distribution is slightly left-skewed (mean is less than median).
I conduct a t-test to compare the distance of the outgoing flights of the two airports HOU and IAH.
#TASK 3.
t.test(Distance ~ Origin,
data = hflights)
##
## Welch Two Sample t-test
##
## data: Distance by Origin
## t = -113.32, df = 98156, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -234.8667 -226.8800
## sample estimates:
## mean in group HOU mean in group IAH
## 609.9853 840.8587
The mean distance of flights leaving from HOU was 609.99 miles. The mean distance of flights leaving fom IAH was 840.86 miles. The distances of the outgoing flights from IAH were significantly longer than the distances of the outgoing flights from HOU, t(98156) = -113.32, p < .001.
First I get a graphic overview of the relationship between the distance a flight covered and its arrival delay.
plot (hflights$Distance, hflights$ArrDelay, xlab = 'Distance [miles]', ylab = 'Arrival Delay [min]', main = 'Relationship between distance and arrival delay', pch = 20, col = 'navyblue')
From eyesight there might be a slightly negative correlation between the two variables.
Next, I check whether the relationship is significant by conducting a correlation test.
#TASK 4.
cor.test (hflights$Distance, hflights$ArrDelay)
##
## Pearson's product-moment correlation
##
## data: hflights$Distance and hflights$ArrDelay
## t = -2.0981, df = 223870, p-value = 0.0359
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0085764454 -0.0002919103
## sample estimates:
## cor
## -0.004434254
There was a significant negative relationship between the distance a flight covered and its arrival delay, r = -.004, t(223870) = -2.1, p < .05.
First, I need to differentiate between days that are weekdays and days which are weekends. I therefore create a function called ‘is.weekend’ that indicates whether a flight took place on a weekend (1) or not (0).
#TASK 10.
is.weekend <- function (x) {
if (x >= 1 & x <= 5) {output <- 0}
if (x == 6 | x == 7) {output <- 1}
if (x <0 | x >7) {output <- 'NA'}
return (output)}
Next, I create a new, empty column called ‘weekend’ which will later indicate wheter a flight took place on a weekend or not.
hflights$weekend <- 'NA'
#new column serves as container for new values
Now, I loop over the newly created column ‘weekend’ and apply the function ‘is.weekend’ to indicate whether a flight took place on a weekend or not.
#TASK 11.
for (i in 1:nrow(hflights)) {
x <- hflights$DayOfWeek [i]
weekend.i <- is.weekend(x)
hflights$weekend [i] <- weekend.i
}
#Checking whether everything went fine
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## weekend
## 5424 1
## 5425 1
## 5426 0
## 5427 0
## 5428 0
## 5429 0
#Looks good.
Now I compare the mean arrival delays on weekdays and weekends.
#TASK 9.
aggregate(hflights$ArrDelay ~ hflights$weekend,
FUN = mean,
na.rm = T)
## hflights$weekend hflights$ArrDelay
## 1 0 7.344843
## 2 1 6.393342
The mean arrival delay on weekdays was 7.34 minutes, the mean arrival delay on weekends was 6.39 minutes.
To see whether this difference was significant, I conduct a t-test.
#TASK 3.
t.test(ArrDelay ~ weekend,
data = hflights)
##
## Welch Two Sample t-test
##
## data: ArrDelay by weekend
## t = 6.6418, df = 109600, p-value = 3.112e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.670716 1.232285
## sample estimates:
## mean in group 0 mean in group 1
## 7.344843 6.393342
The arrival delays on weekdays were significantly longer than on weekends, t(109600) = 6.64, p < .001.
I create a new column called ‘Season’ into which I copy the values of the colum ‘Month’
hflights$Season <- hflights$Month
#Checking whether everything went fine
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## weekend Season
## 5424 1 1
## 5425 1 1
## 5426 0 1
## 5427 0 1
## 5428 0 1
## 5429 0 1
#looks good
Now I recode the values of the column ‘Season’, so that they have the following meanings: 100 = spring (March, April, May) 200 = summer (June, July, August) 300 = fall (September, October, November) 400 = winter (December, January, February)
#TASK 1.
#recoding the spring months
hflights$Season [hflights$Season == 3] <- 100
hflights$Season [hflights$Season == 4] <- 100
hflights$Season [hflights$Season == 5] <- 100
#recoding the summer months
hflights$Season [hflights$Season == 6] <- 200
hflights$Season [hflights$Season == 7] <- 200
hflights$Season [hflights$Season == 8] <- 200
#recoding the fall months
hflights$Season [hflights$Season == 9] <- 300
hflights$Season [hflights$Season == 10] <- 300
hflights$Season [hflights$Season == 11] <- 300
#recoding the winter months
hflights$Season [hflights$Season == 12] <- 400
hflights$Season [hflights$Season == 1] <- 400
hflights$Season [hflights$Season == 2] <- 400
#checking whether everything went fine
table(hflights$Season)
##
## 100 200 300 400
## 57235 60324 54782 55155
#looks good
Now I compare the distribution of the dependent variable ‘DepDelay’ for the four levels of the categorical variable ‘Season’ by means of a boxplot.
#TASK 8.
boxplot(formula = DepDelay ~ Season,
data = hflights,
main = 'Departure delay by season',
xlab = 'Season',
ylab = 'Departure delay [min]',
border = c('springgreen', 'yellow', 'orange', 'skyblue'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))
In all four seasons the median departure delays were around 0 min. Also, all populations are right-skewed with great outliers up to almost 1000 minutes.
To get further information, I calculate the aggregated mean, standard deviation and mean across the four different seasons.
#TASK 9.
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = mean,
na.rm = T),
sd = aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = sd,
na.rm = T),
median= aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = median,
na.rm = T)
)
aggregated.mean.sd.median
## mean.Season mean.DepDelay sd.Season sd.DepDelay median.Season
## 1 100 10.812630 100 31.11001 100
## 2 200 10.757087 200 28.89615 200
## 3 300 6.626174 300 26.10013 300
## 4 400 9.403900 400 28.58172 400
## median.DepDelay
## 1 1
## 2 1
## 3 -1
## 4 1
In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.81, SD = 31.11). In fall, on average, the departure delays were smallest (M = 6.62, SD = 26.1).
First I calculate the frequencies and relative frequencies of the carriers.
flights.per.carrier <- cbind (Frequency = table(hflights$UniqueCarrier), RelFreq = prop.table (table(hflights$UniqueCarrier)))
flights.per.carrier
## Frequency RelFreq
## AA 3244 0.0142595914
## AS 365 0.0016044238
## B6 695 0.0030549988
## CO 70032 0.3078383796
## DL 2641 0.0116089953
## EV 2204 0.0096880824
## F9 838 0.0036835812
## FL 2139 0.0094023631
## MQ 4648 0.0204311285
## OO 16061 0.0705990435
## UA 2072 0.0091078524
## US 4082 0.0179431726
## WN 45343 0.1993133945
## XE 73053 0.3211177339
## YV 79 0.0003472589
In this dataset, the carrier XE carries out the most flights (32.11%) followed by the carrier CO (30.78%).
Additional information: Online I found out that the abbreviation XE stands for ExpressJet Airlines, Inc. and that the abbrevation CO stands for Continental Airlines, Inc.
I calculate a regression analysis of the dependent variable ‘AirTime’ as a function of the independent variable ‘Distance’ seperately for ExpressJet Airlines, Inc. and Continental Airlines, Inc.
#TASK 5.
#Regression analysis for ExpressJet Airlines, Inc.
airtime.lm.XE <- lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 'XE')
summary (airtime.lm.XE)
##
## Call:
## lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier ==
## "XE")
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.961 -3.898 -0.387 3.451 113.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.395e+01 6.080e-02 229.4 <2e-16 ***
## Distance 1.174e-01 9.306e-05 1261.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.991 on 71667 degrees of freedom
## (1384 observations deleted due to missingness)
## Multiple R-squared: 0.9569, Adjusted R-squared: 0.9569
## F-statistic: 1.592e+06 on 1 and 71667 DF, p-value: < 2.2e-16
#Regression analysis for Continental Airlines, Inc.
airtime.lm.CO <- lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 'CO')
summary (airtime.lm.CO)
##
## Call:
## lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier ==
## "CO")
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.777 -8.047 -0.283 6.744 82.730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.618e+00 1.144e-01 84.11 <2e-16 ***
## Distance 1.238e-01 9.462e-05 1307.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.61 on 69371 degrees of freedom
## (659 observations deleted due to missingness)
## Multiple R-squared: 0.961, Adjusted R-squared: 0.961
## F-statistic: 1.711e+06 on 1 and 69371 DF, p-value: < 2.2e-16
For ExpressJet, Inc. the independent variable ‘Distance’ significantly predicted the dependent variable ‘AirTime’, b = .12, t(71667) = 1261.7, p < .001. Furthermore ‘Distance’ explained a significant amount of the variance in the dependent variable ‘AirTime’, R² = .96, F(1, 71667) = 1592000, p < .001.
Also for Continental Airlines, Inc. the independent variable ‘Distance’ significantly predicted the dependent variable ‘AirTime’, b = .12, t(69371) = 1307.94, p < .001. Furthermore ‘Distance’ explained a significant amount of the variance in the dependent varialbe ‘AirTime’, R² = .96, F(1, 69371) = 1711000, p < .001.
Seen individually, these results are not surprising. However I will contrast the findings between the two carriers ExpressJet, Inc. and Continental Airlilnes, Inc. now:
I create a scatterplot containing air time as a function of distance by carrier.
#TASK 6
XE <- subset(hflights, UniqueCarrier == 'XE')
CO <- subset(hflights, UniqueCarrier == 'CO')
plot (x = XE$Distance,
y = XE$AirTime,
xlab = 'Distance [miles]',
ylab = 'Air time [min]',
main = 'Air time as a function of distance by carrier',
pch=20,
col='lightskyblue'
)
points (x = CO$Distance,
y = CO$AirTime,
pch=20,
col='lightsalmon'
)
abline (airtime.lm.XE , col = 'skyblue')
abline (airtime.lm.CO, col = 'salmon')
legend ('topleft',
legend = c('ExpressJet, Inc.', 'Continental Airlines, Inc.'),
col = c('lightskyblue', 'lightsalmon'),
pch = 20)
The graph shows the positive relationship between the distance and the air time for the carriers ExpressJet, Inc. and Continental Airlines, Inc. For distances below ~ 700 miles Continental Airlines, Inc. flights are on average faster than ExpressJet, Inc. For distances above ~ 700 miles this relationship is reversed.
Implications of the findings: If somebody is to chose a flight from one of the two carriers based on which one is the faster one, one should fly with Continental Airlines, Inc. for distances below ~700 miles and with EspressJet, Inc. for distances above ~ 700 miles.
In this final paper 227496 recorded flights leaving from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011 were analyzed. The mean distance of these flights was 787.78 miles (SD = 453.68), the median distance was 809 miles. Most flights were shorter than 2000 miles, although there were a few which were about 4000 miles. Comparing IAH and HOU, on average there were longer flights leaving from IAH (M = 840.86 miles) than from HOU (M = 609.99 miles).
There was a negative relationship between the distance a flight covered and its arrival delay, meaning the longer a flight was, the less was its arrival delay. On weekdays (Mon-Fri) the arrival delays were longer compared to arrival delays on weekends. The median departure delays were all year - regardless of the season- around 0 minutes, however with great outliers up to almost 1000 minutes.
Not surprisingly, the distance of a flight predicted its air time really well. However, there emerged differences of this relationship between different carriers: For distances below ~ 700 miles Continental Airlines, Inc. flights were on average faster than ExpressJet, Inc. whereas this relationship was reversed for distances above ~ 700 miles. So, if time is an important aspect in choosing a flight one should consider different airlines for different flight distances.
Final remarks: All statistical analyses were significant. However, one should be aware of the fact that this is a big data set - it is therfore easy to aquire statistically significant results.