Load Libraries:
# load Libraries
library(hflights)
library(ggplot2)
In this project we are analyzing a dataset containing all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011. Flight dataset is already existing as a library so we decided to use the same. The data originally came from the Research and Innovation Technology Administration at the Bureau of Transporation statistics: http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0
We will try and see if there is relationship between departure and arrival delay and weekdays
We will try to establish a relationship between the departure and arrival delay and season
There were 227496 rows and 21 columns in the dataset. The names of the columns were
## [1] 227496 21
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
Data is collected via the hflights library in R. From the documentation here is the varible definition
Year, Month, DayofMonth: date of departure
DayOfWeek: day of week of departure (useful for removing weekend effects)
DepTime, ArrTime: departure and arrival times (in local time, hhmm)
UniqueCarrier: unique abbreviation for a carrier
FlightNum: flight number
TailNum: airplane tail number
ActualElapsedTime: elapsed time of flight, in minutes
AirTime: flight time, in minutes
ArrDelay, DepDelay: arrival and departure delays, in minutes
Origin, Dest origin and destination airport codes
Distance: distance of flight, in miles
TaxiIn, TaxiOut: taxi in and out times in minutes
Cancelled: cancelled indicator: 1 = Yes, 0 = No
CancellationCode: reason for cancellation: A = carrier, B = weather, C = national air system, D = security
Diverted: diverted indicator: 1 = Yes, 0 = No
First, we need to differentiate between days that are weekdays and days which are weekends. we therefore create a function called âisweekendâ that indicates whether a flight took place on a weekend (1) or not (0).
Next, we create a new, empty column called âweekendâ which will later indicate wheter a flight took place on a weekend or not.
Now, we loop over the newly created column âweekendâ and apply the function âisweekendâ to indicate whether a flight took place on a weekend or not.
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## weekend
## 5424 1
## 5425 1
## 5426 0
## 5427 0
## 5428 0
## 5429 0
Now we compare the mean arrival/departure delays on weekdays and weekends.
## hflights$weekend hflights$ArrDelay
## 1 0 7.344843
## 2 1 6.393342
## hflights$weekend hflights$DepDelay
## 1 0 9.657671
## 2 1 8.849475
we create a new column called âSeasonâ into which we copy the values of the colum âMonthâ
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## weekend Season
## 5424 1 1
## 5425 1 1
## 5426 0 1
## 5427 0 1
## 5428 0 1
## 5429 0 1
Now we will decode the numbers to season name
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'
#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'
#Assigning the fall months
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'
#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
## weekend Season
## 5424 1 Winter
## 5425 1 Winter
## 5426 0 Winter
## 5427 0 Winter
## 5428 0 Winter
## 5429 0 Winter
Now we compare the distribution of the dependent variable âDepDelayâ for the four levels of the categorical variable âSeasonâ by means of a boxplot.
In all four seasons the median departure delays were around 0 min. Also, all populations are right-skewed with great outliers up to almost 1000 minutes.
To get further information, we calculate the aggregated mean, standard deviation and mean across the four different seasons.
## mean.Season mean.DepDelay sd.Season sd.DepDelay median.Season
## 1 Fall 6.626174 Fall 26.10013 Fall
## 2 Spring 10.812630 Spring 31.11001 Spring
## 3 Summer 10.757087 Summer 28.89615 Summer
## 4 Winter 9.403900 Winter 28.58172 Winter
## median.DepDelay
## 1 -1
## 2 1
## 3 1
## 4 1
## mean.Season mean.ArrDelay sd.Season sd.ArrDelay median.Season
## 1 Fall 3.729093 Fall 28.30416 Fall
## 2 Spring 10.676809 Spring 33.29790 Spring
## 3 Summer 8.315693 Summer 30.38325 Summer
## 4 Winter 5.381636 Winter 30.08715 Winter
## median.ArrDelay
## 1 -2
## 2 2
## 3 1
## 4 -1
The mean arrival delay on weekdays was 7.34 minutes, the mean arrival delay on weekends was 6.39 minutes. The mean departure delay on weekdays was 9.6 minutes, the mean departure delay on weekends was 8.8 minutes.
Lets conduct a t-test to see how significant the difference is:-
##
## Welch Two Sample t-test
##
## data: ArrDelay by weekend
## t = 6.6418, df = 109600, p-value = 3.112e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.670716 1.232285
## sample estimates:
## mean in group 0 mean in group 1
## 7.344843 6.393342
##
## Welch Two Sample t-test
##
## data: DepDelay by weekend
## t = 6.0203, df = 109790, p-value = 1.746e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.5450792 1.0713123
## sample estimates:
## mean in group 0 mean in group 1
## 9.657671 8.849475
The arrival delays on weekdays were significantly longer than on weekends, t(109600) = 6.64, p < .001. The departure delays on weekdays were significantly longer than on weekends, t(109600) = 6.02, p < .001.
For Season relationship
In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.81, SD = 31.11). In fall, on average, the departure delays were smallest (M = 6.62, SD = 26.1).
In all four seasons the mean departure delay was greater than the median departure delay. In spring, on average, the departure delays were greatest (M = 10.67, SD = 33.29). In fall, on average, the departure delays were smallest (M = 3.72, SD = 28.32).
This might be mostly because most people travel in Spring.
The Fall season seems to be the best to travel as far as the Arrival deplays goes with least Departure/Arrival delays. In addition, travelling on weekends is beter on average as weekdays from travellers from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in the year 2011
# load Libraries
library(hflights)
library(ggplot2)
dim(hflights)
names(hflights)
isweekend <- function (x) {
if (x >= 1 & x <= 5) {output <- 0}
if (x == 6 | x == 7) {output <- 1}
if (x <0 | x >7) {output <- 'NA'}
return (output)}
hflights$weekend <- 'NA'
#new column serves as container for new values
for (i in 1:nrow(hflights)) {
x <- hflights$DayOfWeek [i]
weekend.i <- isweekend(x)
hflights$weekend [i] <- weekend.i
}
#Checking whether everything went fine
head(hflights)
aggregate(hflights$ArrDelay ~ hflights$weekend,
FUN = mean,
na.rm = T)
aggregate(hflights$DepDelay ~ hflights$weekend,
FUN = mean,
na.rm = T)
hflights$Season <- hflights$Month
#Checking whether everything went fine
head(hflights)
hflights$Season [hflights$Season == 3] <- 'Spring'
hflights$Season [hflights$Season == 4] <- 'Spring'
hflights$Season [hflights$Season == 5] <- 'Spring'
#Assigning the summer months
hflights$Season [hflights$Season == 6] <- 'Summer'
hflights$Season [hflights$Season == 7] <- 'Summer'
hflights$Season [hflights$Season == 8] <- 'Summer'
#Assigning the fall months
hflights$Season [hflights$Season == 9] <- 'Fall'
hflights$Season [hflights$Season == 10] <- 'Fall'
hflights$Season [hflights$Season == 11] <- 'Fall'
#Assigning the winter months
hflights$Season [hflights$Season == 12] <- 'Winter'
hflights$Season [hflights$Season == 1] <- 'Winter'
hflights$Season [hflights$Season == 2] <- 'Winter'
head(hflights)
boxplot(formula = DepDelay ~ Season,
data = hflights,
main = 'Departure delay by season',
xlab = 'Season',
ylab = 'Departure delay [min]',
border = c('blue', 'green', 'orange', 'grey'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))
boxplot(formula = ArrDelay ~ Season,
data = hflights,
main = 'Arrival delay by season',
xlab = 'Season',
ylab = 'Arrival delay [min]',
border = c('blue', 'green', 'orange', 'grey'),
names = c('Spring', 'Summer', 'Fall', 'Winter'))
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = mean,
na.rm = T),
sd = aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = sd,
na.rm = T),
median= aggregate(formula = DepDelay ~ Season,
data = hflights,
FUN = median,
na.rm = T)
)
aggregated.mean.sd.median
aggregated.mean.sd.median <- cbind(
mean = aggregate(formula = ArrDelay ~ Season,
data = hflights,
FUN = mean,
na.rm = T),
sd = aggregate(formula = ArrDelay ~ Season,
data = hflights,
FUN = sd,
na.rm = T),
median= aggregate(formula = ArrDelay ~ Season,
data = hflights,
FUN = median,
na.rm = T)
)
aggregated.mean.sd.median
t.test(ArrDelay ~ weekend,
data = hflights)
t.test(DepDelay ~ weekend,
data = hflights)