Load Libraries:

# load Libraries
library(hflights)
library(ggplot2)

Part 1 - Introduction

In this project we are analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so we decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0

We will try and see if there is relationship between departure and arrival delay and weekdays

We will try to establish a relationship between the departure and arrival delay and season

Part 2 - Data

There were 227496 rows and 21 columns in the dataset. The names of the columns were

## [1] 227496     21
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"

Data is collected via the hflights library in R. From the documentation here is the varible definition

Year, Month, DayofMonth: date of departure

DayOfWeek: day of week of departure (useful for removing weekend effects)

DepTime, ArrTime: departure and arrival times (in local time, hhmm)

UniqueCarrier: unique abbreviation for a carrier

FlightNum: flight number

TailNum: airplane tail number

ActualElapsedTime: elapsed time of flight, in minutes

AirTime: flight time, in minutes

ArrDelay, DepDelay: arrival and departure delays, in minutes

Origin, Dest origin and destination airport codes

Distance: distance of flight, in miles

TaxiIn, TaxiOut: taxi in and out times in minutes

Cancelled: cancelled indicator: 1 = Yes, 0 = No

CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security

Diverted: diverted indicator: 1 = Yes, 0 = No

Part 3 - Exploratory data analysis

We will try to establish a relationship between the departure and arrival delay and season

First, we need to differentiate between days that are weekdays and days which are weekends. we therefore create a function called ‘isweekend’ that indicates whether a flight took place on a weekend (1) or not (0).

Next, we create a new, empty column called ‘weekend’ which will later indicate wheter a flight took place on a weekend or not.

Now, we loop over the newly created column ‘weekend’ and apply the function ‘isweekend’ to indicate whether a flight took place on a weekend or not.

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      weekend
## 5424       1
## 5425       1
## 5426       0
## 5427       0
## 5428       0
## 5429       0

Now we compare the mean arrival/departure delays on weekdays and weekends.

##   hflights$weekend hflights$ArrDelay
## 1                0          7.344843
## 2                1          6.393342
##   hflights$weekend hflights$DepDelay
## 1                0          9.657671
## 2                1          8.849475

Now, We will try to establish a relationship between the departure and arrival delay and season

we create a new column called ‘Season’ into which we copy the values of the colum ‘Month’

##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      weekend Season
## 5424       1      1
## 5425       1      1
## 5426       0      1
## 5427       0      1
## 5428       0      1
## 5429       0      1

Now we will decode the numbers to season name

hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'

#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'

#Assigning the fall months 
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'

#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'

head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
##      weekend Season
## 5424       1 Winter
## 5425       1 Winter
## 5426       0 Winter
## 5427       0 Winter
## 5428       0 Winter
## 5429       0 Winter

Now we compare the distribution of the dependent variable ‘DepDelay’ for the four levels of the categorical variable ‘Season’ by means of a boxplot.

In all four seasons the median departure delays were around 0 min. Also, all populations are right-skewed with great outliers up to almost 1000 minutes.

To get further information, we calculate the aggregated mean, standard deviation and mean across the four different seasons.

##   mean.Season mean.DepDelay sd.Season sd.DepDelay median.Season
## 1        Fall      6.626174      Fall    26.10013          Fall
## 2      Spring     10.812630    Spring    31.11001        Spring
## 3      Summer     10.757087    Summer    28.89615        Summer
## 4      Winter      9.403900    Winter    28.58172        Winter
##   median.DepDelay
## 1              -1
## 2               1
## 3               1
## 4               1
##   mean.Season mean.ArrDelay sd.Season sd.ArrDelay median.Season
## 1        Fall      3.729093      Fall    28.30416          Fall
## 2      Spring     10.676809    Spring    33.29790        Spring
## 3      Summer      8.315693    Summer    30.38325        Summer
## 4      Winter      5.381636    Winter    30.08715        Winter
##   median.ArrDelay
## 1              -2
## 2               2
## 3               1
## 4              -1

Part 4 - Inference

The mean arrival delay on weekdays was 7.34 minutes, the mean arrival delay on weekends was 6.39 minutes. The mean departure delay on weekdays was 9.6 minutes, the mean departure delay on weekends was 8.8 minutes.

Lets conduct a t-test to see how significant the difference is:-

## 
##  Welch Two Sample t-test
## 
## data:  ArrDelay by weekend
## t = 6.6418, df = 109600, p-value = 3.112e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.670716 1.232285
## sample estimates:
## mean in group 0 mean in group 1 
##        7.344843        6.393342
## 
##  Welch Two Sample t-test
## 
## data:  DepDelay by weekend
## t = 6.0203, df = 109790, p-value = 1.746e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5450792 1.0713123
## sample estimates:
## mean in group 0 mean in group 1 
##        9.657671        8.849475

The arrival delays on weekdays were significantly longer than on weekends, t(109600) = 6.64, p < .001. The departure delays on weekdays were significantly longer than on weekends, t(109600) = 6.02, p < .001.

For Season relationship

In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.81, SD = 31.11). In fall, on average, the departure delays were smallest (M = 6.62, SD = 26.1).

In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.67, SD = 33.29). In fall, on average, the departure delays were smallest (M = 3.72, SD = 28.32).

This might be mostly because most people travel in Spring.

Part 5 - Conclusion

The Fall season seems to be the best to travel as far as the Arrival deplays goes with least Departure/Arrival delays. In addition, travelling on weekends is beter on average as weekdays from travellers from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011

References

http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0

Appendix

# load Libraries
library(hflights)
library(ggplot2)
dim(hflights)
names(hflights)
isweekend <- function (x) {
  if (x >= 1 & x <= 5) {output <- 0}
  if (x == 6 | x == 7) {output <- 1}
  if (x <0 | x >7) {output <- 'NA'}
  return (output)}
hflights$weekend <- 'NA' 
#new column serves as container for new values
for (i in 1:nrow(hflights)) {
  x <- hflights$DayOfWeek [i]
  weekend.i <- isweekend(x)
  hflights$weekend [i] <- weekend.i 
}

#Checking whether everything went fine

head(hflights)
aggregate(hflights$ArrDelay ~ hflights$weekend,
          FUN = mean, 
          na.rm = T)

aggregate(hflights$DepDelay ~ hflights$weekend,
          FUN = mean, 
          na.rm = T)

hflights$Season <- hflights$Month

#Checking whether everything went fine
head(hflights)
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'

#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'

#Assigning the fall months 
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'

#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'

head(hflights)
boxplot(formula = DepDelay ~ Season,
           data = hflights,
        main = 'Departure delay by season',
        xlab = 'Season',
        ylab = 'Departure delay [min]',
        border = c('blue', 'green', 'orange', 'grey'),
        names = c('Spring', 'Summer', 'Fall', 'Winter'))
boxplot(formula = ArrDelay ~ Season,
           data = hflights,
        main = 'Arrival delay by season',
        xlab = 'Season',
        ylab = 'Arrival delay [min]',
        border = c('blue', 'green', 'orange', 'grey'),
        names = c('Spring', 'Summer', 'Fall', 'Winter'))
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = mean, 
            na.rm = T),
sd = aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = sd, 
            na.rm = T),
median= aggregate(formula = DepDelay ~ Season,
           data = hflights,
           FUN = median, 
            na.rm = T)
)

aggregated.mean.sd.median
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = ArrDelay ~ Season,
           data = hflights,
           FUN = mean, 
            na.rm = T),
sd = aggregate(formula = ArrDelay ~ Season,
           data = hflights,
           FUN = sd, 
            na.rm = T),
median= aggregate(formula = ArrDelay ~ Season,
           data = hflights,
           FUN = median, 
            na.rm = T)
)

aggregated.mean.sd.median
t.test(ArrDelay ~ weekend,
       data = hflights)
t.test(DepDelay ~ weekend,
       data = hflights)