In this final paper I analyzed a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011.

# loading the dataset into the current session
library('hflights')
## Warning: package 'hflights' was built under R version 3.2.3
# finding out about the number of columns 
ncol(hflights)
## [1] 21
# finding out about the number of rows
nrow(hflights)
## [1] 227496

Dataset description

I got inspired to use this dataset from the description of the final task of this class’s assignment (http://www.rpubs.com/YaRrr/Winter1516FinalPaper). Then I found out that this dataset existed as a package called ‘hflights’ already, so I downloaded it directly from R. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0. There were 227496 rows and 21 columns in the dataset. The names of the columns were

names(hflights)
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"

The columns mean (information retrieved from http://www.inside-r.org/node/224880 on 2016/02/03):

Questions

Question 1. What is the mean (including standard deviation) and median distance of the flights? What does the distribution of the variable ‘Distance’ look like?

Question 2. Comparing HOU and IAH, from which airport are there on average leaving longer flights?

Question 3. Was there a significant correlation between the distance a flight covered and its arrival delay?

Question 4. Was there a difference between arrival delays on weekends and arrival delays during the week?

Question 5. Compare the departure delays between the seasons of year.

Question 6. Compare the air time as a function of the distance between the two carriers that carry out the most flights in this data set. What implications do the findings have?

Analyses

Question 1.What is the mean (including standard deviation) and median distance of the flights? What does the distribution of the variable ‘Distance’ look like?

First I calculate the mean, sd and median for the variable ‘Distance’.

#TASK 2.
mean(hflights$Distance)
## [1] 787.7832
sd(hflights$Distance)
## [1] 453.6806
median(hflights$Distance)
## [1] 809

The mean distance was 787.78 miles (SD = 453.68), the median distance was 809 miles.

Next I create a histogram of the variable ‘Distance’ to get a graphic overview of the distribution.

#TASK 7.
hist(hflights$Distance,
     main = 'Distribution of flight distances',
     xlab = 'Distance [miles]',
    col = 'skyblue',
     border = 'skyblue4')
#I add a line indicating the mean of the group.
abline(v=mean(hflights$Distance), 
       col = 'red',
       lwd = 3)
# I add a line indicating the median of the group.
abline(v= median(hflights$Distance), 
       col = 'purple',
       lty = 5,
       lwd = 3)
legend('topright',
       legend = c('mean', 'median'), 
       lty = c(1,5),
       lwd = c(3,3),
       col = c('red', 'purple'))

Most flights were shorter than 2000 miles, although there were a few which are around 4000 miles. The distribution is slightly left-skewed (mean is less than median).

Question 2. Comparing HOU and IAH, from which airport are there on average leaving longer flights?

I conduct a t-test to compare the distance of the outgoing flights of the two airports HOU and IAH.

#TASK 3.
t.test(Distance ~ Origin,
       data = hflights)
## 
##  Welch Two Sample t-test
## 
## data:  Distance by Origin
## t = -113.32, df = 98156, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -234.8667 -226.8800
## sample estimates:
## mean in group HOU mean in group IAH 
##          609.9853          840.8587

The mean distance of flights leaving from HOU was 609.99 miles. The mean distance of flights leaving fom IAH was 840.86 miles. The distances of the outgoing flights from IAH were significantly longer than the distances of the outgoing flights from HOU, t(98156) = -113.32, p < .001.

Question 3. Was there a significant correlation between the distance a flight covered and its arrival delay?

First I get a graphic overview of the relationship between the distance a flight covered and its arrival delay.

plot (hflights$Distance, hflights$ArrDelay, xlab = 'Distance [miles]', ylab = 'Arrival Delay [min]', main = 'Relationship between distance and arrival delay', pch = 20, col = 'navyblue')

From eyesight there might be a slightly negative correlation between the two variables.

Next, I check whether the relationship is significant by conducting a correlation test.

#TASK 4.
cor.test (hflights$Distance, hflights$ArrDelay)
## 
##  Pearson's product-moment correlation
## 
## data:  hflights$Distance and hflights$ArrDelay
## t = -2.0981, df = 223870, p-value = 0.0359
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0085764454 -0.0002919103
## sample estimates:
##          cor 
## -0.004434254

There was a significant negative relationship between the distance a flight covered and its arrival delay, r = -.004, t(223870) = -2.1, p < .05.

Question 4. Was there a difference between arrival delays on weekends and arrival delays during the week?

First, I need to differentiate between days that are weekdays and days which are weekends. I therefore create a function called ‘is.weekend’ that indicates whether a flight took place on a weekend (1) or not (0).

#TASK 10.
is.weekend <- function (x) {
  if (x >= 1 & x <= 5) {output <- 0}
  if (x == 6 | x == 7) {output <- 1}
  if (x <0 | x >7) {output <- 'NA'}
  return (output)}

Next, I create a new, empty column called ‘weekend’ which will later indicate wheter a flight took place on a weekend or not.

hflights$weekend <- 'NA' 
#new column serves as container for new values

Now, I loop over the newly created column ‘weekend’ and apply the function ‘is.weekend’ to indicate whether a flight took place on a weekend or not.

#TASK 11.
for (i in 1:nrow(hflights)) {
  x <- hflights$DayOfWeek [i]
  weekend.i <- is.weekend(x)
  hflights$weekend [i] <- weekend.i 
}

#Checking whether everything went fine

head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      weekend
## 5424       1
## 5425       1
## 5426       0
## 5427       0
## 5428       0
## 5429       0
#Looks good.

Now I compare the mean arrival delays on weekdays and weekends.

#TASK 9.
aggregate(hflights$ArrDelay ~ hflights$weekend,
          FUN = mean, 
          na.rm = T)
##   hflights$weekend hflights$ArrDelay
## 1                0          7.344843
## 2                1          6.393342

The mean arrival delay on weekdays was 7.34 minutes, the mean arrival delay on weekends was 6.39 minutes.

To see whether this difference was significant, I conduct a t-test.

#TASK 3.
t.test(ArrDelay ~ weekend,
       data = hflights)
## 
##  Welch Two Sample t-test
## 
## data:  ArrDelay by weekend
## t = 6.6418, df = 109600, p-value = 3.112e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.670716 1.232285
## sample estimates:
## mean in group 0 mean in group 1 
##        7.344843        6.393342

The arrival delays on weekdays were significantly longer than on weekends, t(109600) = 6.64, p < .001.

Question 5. Compare the departure delays between the seasons of year.

I create a new column called ‘Season’ into which I copy the values of the colum ‘Month’

hflights$Season <- hflights$Month

#Checking whether everything went fine
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      weekend Season
## 5424       1      1
## 5425       1      1
## 5426       0      1
## 5427       0      1
## 5428       0      1
## 5429       0      1
#looks good

Now I recode the values of the column ‘Season’, so that they have the following meanings: 100 = spring (March, April, May) 200 = summer (June, July, August) 300 = fall (September, October, November) 400 = winter (December, January, February)

#TASK 1.
#recoding the spring months
hflights$Season [hflights$Season == 3] <- 100
hflights$Season [hflights$Season == 4] <- 100
hflights$Season [hflights$Season == 5] <- 100

#recoding the summer months
hflights$Season [hflights$Season == 6] <- 200
hflights$Season [hflights$Season == 7] <- 200
hflights$Season [hflights$Season == 8] <- 200

#recoding the fall months 
hflights$Season [hflights$Season == 9] <- 300
hflights$Season [hflights$Season == 10] <- 300
hflights$Season [hflights$Season == 11] <- 300

#recoding the winter months
hflights$Season [hflights$Season == 12] <- 400
hflights$Season [hflights$Season == 1] <- 400
hflights$Season [hflights$Season == 2] <- 400

#checking whether everything went fine
table(hflights$Season)
## 
##   100   200   300   400 
## 57235 60324 54782 55155
#looks good

Now I compare the distribution of the dependent variable ‘DepDelay’ for the four levels of the categorical variable ‘Season’ by means of a boxplot.

#TASK 8.
boxplot(formula = DepDelay ~ Season,
           data = hflights,
        main = 'Departure delay by season',
        xlab = 'Season',
        ylab = 'Departure delay [min]',
        border = c('springgreen', 'yellow', 'orange', 'skyblue'),
        names = c('Spring', 'Summer', 'Fall', 'Winter'))

In all four seasons the median departure delays were around 0 min. Also, all populations are right-skewed with great outliers up to almost 1000 minutes.

To get further information, I calculate the aggregated mean, standard deviation and mean across the four different seasons.

#TASK 9.
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = mean, 
            na.rm = T),
sd = aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = sd, 
            na.rm = T),
median= aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = median, 
            na.rm = T)
)

aggregated.mean.sd.median 
##   mean.Season mean.DepDelay sd.Season sd.DepDelay median.Season
## 1         100     10.812630       100    31.11001           100
## 2         200     10.757087       200    28.89615           200
## 3         300      6.626174       300    26.10013           300
## 4         400      9.403900       400    28.58172           400
##   median.DepDelay
## 1               1
## 2               1
## 3              -1
## 4               1

In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.81, SD = 31.11). In fall, on average, the departure delays were smallest (M = 6.62, SD = 26.1).

Question 6. Compare the air time as a function of the distance between the two carriers that carry out the most flights in this data set. What implications do the findings have?

First I calculate the frequencies and relative frequencies of the carriers.

flights.per.carrier <- cbind (Frequency  = table(hflights$UniqueCarrier), RelFreq = prop.table (table(hflights$UniqueCarrier)))

flights.per.carrier
##    Frequency      RelFreq
## AA      3244 0.0142595914
## AS       365 0.0016044238
## B6       695 0.0030549988
## CO     70032 0.3078383796
## DL      2641 0.0116089953
## EV      2204 0.0096880824
## F9       838 0.0036835812
## FL      2139 0.0094023631
## MQ      4648 0.0204311285
## OO     16061 0.0705990435
## UA      2072 0.0091078524
## US      4082 0.0179431726
## WN     45343 0.1993133945
## XE     73053 0.3211177339
## YV        79 0.0003472589

In this dataset, the carrier XE carries out the most flights (32.11%) followed by the carrier CO (30.78%).

Additional information: Online I found out that the abbreviation XE stands for ExpressJet Airlines, Inc. and that the abbrevation CO stands for Continental Airlines, Inc.

I calculate a regression analysis of the dependent variable ‘AirTime’ as a function of the independent variable ‘Distance’ seperately for ExpressJet Airlines, Inc. and Continental Airlines, Inc.

#TASK 5. 
#Regression analysis for ExpressJet Airlines, Inc.
airtime.lm.XE <- lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 'XE')
summary (airtime.lm.XE)
## 
## Call:
## lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 
##     "XE")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.961  -3.898  -0.387   3.451 113.102 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.395e+01  6.080e-02   229.4   <2e-16 ***
## Distance    1.174e-01  9.306e-05  1261.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.991 on 71667 degrees of freedom
##   (1384 observations deleted due to missingness)
## Multiple R-squared:  0.9569, Adjusted R-squared:  0.9569 
## F-statistic: 1.592e+06 on 1 and 71667 DF,  p-value: < 2.2e-16
#Regression analysis for Continental Airlines, Inc.
airtime.lm.CO <- lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 'CO')
summary (airtime.lm.CO)
## 
## Call:
## lm(formula = AirTime ~ Distance, data = hflights, subset = UniqueCarrier == 
##     "CO")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.777  -8.047  -0.283   6.744  82.730 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.618e+00  1.144e-01   84.11   <2e-16 ***
## Distance    1.238e-01  9.462e-05 1307.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.61 on 69371 degrees of freedom
##   (659 observations deleted due to missingness)
## Multiple R-squared:  0.961,  Adjusted R-squared:  0.961 
## F-statistic: 1.711e+06 on 1 and 69371 DF,  p-value: < 2.2e-16

For ExpressJet, Inc. the independent variable ‘Distance’ significantly predicted the dependent variable ‘AirTime’, b = .12, t(71667) = 1261.7, p < .001. Furthermore ‘Distance’ explained a significant amount of the variance in the dependent variable ‘AirTime’, R² = .96, F(1, 71667) = 1592000, p < .001.

Also for Continental Airlines, Inc. the independent variable ‘Distance’ significantly predicted the dependent variable ‘AirTime’, b = .12, t(69371) = 1307.94, p < .001. Furthermore ‘Distance’ explained a significant amount of the variance in the dependent varialbe ‘AirTime’, R² = .96, F(1, 69371) = 1711000, p < .001.

Seen individually, these results are not surprising. However I will contrast the findings between the two carriers ExpressJet, Inc. and Continental Airlilnes, Inc. now:

I create a scatterplot containing air time as a function of distance by carrier.

#TASK 6
XE <- subset(hflights, UniqueCarrier == 'XE')
CO <- subset(hflights, UniqueCarrier == 'CO')

plot (x = XE$Distance,
      y = XE$AirTime,
      xlab = 'Distance [miles]',
      ylab = 'Air time [min]',
      main = 'Air time as a function of distance by carrier',
      pch=20,
      col='lightskyblue'
      )
points (x = CO$Distance,
      y = CO$AirTime,
     pch=20,
      col='lightsalmon'
      )

abline (airtime.lm.XE , col = 'skyblue')
abline (airtime.lm.CO, col = 'salmon')

legend ('topleft', 
        legend = c('ExpressJet, Inc.', 'Continental Airlines, Inc.'),
        col = c('lightskyblue', 'lightsalmon'),
        pch = 20)

The graph shows the positive relationship between the distance and the air time for the carriers ExpressJet, Inc. and Continental Airlines, Inc. For distances below ~ 700 miles Continental Airlines, Inc. flights are on average faster than ExpressJet, Inc. For distances above ~ 700 miles this relationship is reversed.

Implications of the findings: If somebody is to chose a flight from one of the two carriers based on which one is the faster one, one should fly with Continental Airlines, Inc. for distances below ~700 miles and with EspressJet, Inc. for distances above ~ 700 miles.

Conclusions

In this final paper 227496 recorded flights leaving from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011 were analyzed. The mean distance of these flights was 787.78 miles (SD = 453.68), the median distance was 809 miles. Most flights were shorter than 2000 miles, although there were a few which were about 4000 miles. Comparing IAH and HOU, on average there were longer flights leaving from IAH (M = 840.86 miles) than from HOU (M = 609.99 miles).

There was a negative relationship between the distance a flight covered and its arrival delay, meaning the longer a flight was, the less was its arrival delay. On weekdays (Mon-Fri) the arrival delays were longer compared to arrival delays on weekends. The median departure delays were all year - regardless of the season- around 0 minutes, however with great outliers up to almost 1000 minutes.

Not surprisingly, the distance of a flight predicted its air time really well. However, there emerged differences of this relationship between different carriers: For distances below ~ 700 miles Continental Airlines, Inc. flights were on average faster than ExpressJet, Inc. whereas this relationship was reversed for distances above ~ 700 miles. So, if time is an important aspect in choosing a flight one should consider different airlines for different flight distances.

Final remarks: All statistical analyses were significant. However, one should be aware of the fact that this is a big data set - it is therfore easy to aquire statistically significant results.