This document pertains to the final Project for the R_Bridge Session. We will be exploring the hflights data from the hflights package. All graphs will be done in ggplot2 so this package is also required.
#install.packages("hflights")
require(hflights)
## Loading required package: hflights
require(ggplot2)
## Loading required package: ggplot2
The data structure can be found at: https://cran.r-project.org/web/packages/hflights/hflights.pdf
For this analysis, we will consider only a subset of the data:
Only the top 4 carriers will be considered. Using a quick histogram base on “UniqueCarrier” (see graph below), it is clear that there are only 4 significant carriers: “XE”, “CO”, “WN”, and to lesser degree “OO”.
Furthermore, information that identify a particular flight or plane will not be relevant in our analysis. Hence, we will remove the following columns from our analysis: “FlightNum” and “TailNum”
Again, with a quick look at the data pertaining to cancellation, it is clear that very few flights are cancelled for any reason. We will therefore filter out flights with a cancelled indicator = 1.
We will focus our analysis on delays and how they relate or not to time of year. To this affect, we will create an additional column based departure date that will indicate season and one based on time of departure that will indicate time of day/night.
Note: We will perform all transformation and filtering on hflights_t, a copy of original dataframe hflights.
hflights_t <- hflights
hflights_t <- hflights_t[hflights_t$UniqueCarrier %in% c("XE", "CO", "WN", "OO"), ]
hflights_t <- hflights_t[hflights_t$Cancelled == 0, ]
drop_columns <- c("FlightNum", "TailNum", "Cancelled", "CancellationCode")
hflights_t <-hflights_t[, !(names(hflights_t) %in% drop_columns)]
TimeOfDay: EarlyMorning, Morning, Afternoon, LateAfternoon, Evening, Night from 401 to 800 -> EarlyMorning from 801 to 1200 -> Morning from 1201 to 1600 -> Afternoon from 1601 to 2000 -> LateAfternoon from 2001 to 2400 -> Evening from 0001 to 400 -> Night
We will map both DepTime and ArrTime. We will write a function and then apply it to column DepTime and ArrTime.
# function to map DepTime or ArrTime to group
timeofday <- function(hhmm){
if (is.na(hhmm)){
timeofday_f <- "Unknown"
} else if(hhmm >= 1 && hhmm <= 400){
timeofday_f <- "Night"
} else if (hhmm <= 800){
timeofday_f <- "EarlyMorning"
} else if (hhmm <= 1200){
timeofday_f <- "Morning"
} else if (hhmm <= 1600){
timeofday_f <- "Afternoon"
} else if (hhmm <= 2000){
timeofday_f <- "LateAfternoon"
} else if (hhmm <= 2400){
timeofday_f <- "Evening"
} else{
timeofday_f <- "Unknown"
}
return(as.factor(timeofday_f))
}
hflights_t$DepTimeGroup <- mapply(timeofday, hflights_t$DepTime)
hflights_t$ArrTimeGroup <- mapply(timeofday, hflights_t$ArrTime)
# Mapp Season to Departure Month/Day
monthdaytoseason <- function(month, day){
if (month == 1 || month == 2){
season_f <-"Winter"
} else if (month == 3 && day <= 19){
season_f <- "Winter"
} else if (month == 3 && day >= 20){
season_f <- "Spring"
} else if (month == 4 || month == 5){
season_f <- "Spring"
} else if (month == 6 && day <= 19){
season_f <- "Spring"
} else if (month == 6 && day >= 20){
season_f <- "Summer"
} else if (month == 7 || month == 8){
season_f <- "Summer"
} else if (month == 9 && day <= 22){
season_f <- "Summer"
} else if (month == 9 && day >= 22){
season_f <- "Fall"
} else if (month == 10 || month == 11){
season_f <- "Fall"
} else if (month == 12 && day <= 20){
season_f <- "Fall"
} else if (month == 12 && day >= 21){
season_f <- "Winter"
} else{
season_f <- "Unkown"
}
return(as.factor(season_f))
}
hflights_t$Season <- mapply(monthdaytoseason, hflights_t$Month, hflights_t$DayofMonth)
hflights_t$DepPos <- hflights_t$DepDelay >= 0
hflights_t$ArrPos <- hflights_t$ArrDelay >= 0
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
summary(hflights_t)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.516 Mean :15.77 Mean :3.946
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier ActualElapsedTime
## Min. : 1 Min. : 1 Length:201955 Min. : 34.0
## 1st Qu.:1032 1st Qu.:1230 Class :character 1st Qu.: 76.0
## Median :1423 Median :1624 Mode :character Median :127.0
## Mean :1416 Mean :1592 Mean :128.7
## 3rd Qu.:1810 3rd Qu.:1958 3rd Qu.:166.0
## Max. :2400 Max. :2400 Max. :575.0
## NA's :83 NA's :596
## AirTime ArrDelay DepDelay Origin
## Min. : 11.0 Min. :-70.000 Min. :-33.000 Length:201955
## 1st Qu.: 57.0 1st Qu.: -7.000 1st Qu.: -2.000 Class :character
## Median :106.0 Median : 0.000 Median : 1.000 Mode :character
## Mean :107.8 Mean : 7.374 Mean : 9.605
## 3rd Qu.:142.0 3rd Qu.: 11.000 3rd Qu.: 10.000
## Max. :549.0 Max. :957.000 Max. :981.000
## NA's :596 NA's :596
## Dest Distance TaxiIn TaxiOut
## Length:201955 Min. : 79.0 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.: 376.0 1st Qu.: 4.000 1st Qu.: 10.00
## Mode :character Median : 802.0 Median : 5.000 Median : 14.00
## Mean : 787.6 Mean : 5.763 Mean : 15.16
## 3rd Qu.:1076.0 3rd Qu.: 7.000 3rd Qu.: 18.00
## Max. :3904.0 Max. :140.000 Max. :163.00
## NA's :83
## Diverted DepTimeGroup ArrTimeGroup
## Min. :0.000000 Morning :50580 Afternoon :48942
## 1st Qu.:0.000000 LateAfternoon:53270 LateAfternoon:57409
## Median :0.000000 Afternoon :56067 Evening :49818
## Mean :0.002951 EarlyMorning :18932 Morning :40933
## 3rd Qu.:0.000000 Evening :22968 Night : 2753
## Max. :1.000000 Night : 138 EarlyMorning : 2017
## Unknown : 83
## Season DepPos ArrPos
## Winter:48435 Mode :logical Mode :logical
## Spring:51148 FALSE:82981 FALSE:96353
## Summer:54698 TRUE :118974 TRUE :105006
## Fall :47674 NA's :0 NA's :596
##
##
##
We will analyse the data with respect to departure delay and seasons.
g3 <- ggplot(hflights_t, aes(x=Season, y=DepDelay, fill=DepPos)) + geom_bar(stat="identity", position = "identity") + scale_fill_manual(values=c("#CCEEFF", "#FFDDDD"), guide = FALSE)
g3
For each season, it is clear that there are long delays however, since this graph map the value and not the count it is difficult to identify where the is clustered.
The next graph is a violin graph. Plotting each one per seasons. We specify scale = count to represent the number of occurence. We have opted for a violin graph after trying boxplot and realizing that the presence of ouliers make the graphing difficult. It also made clear that the first graph is not a good representation of the data.
g4 <- ggplot(hflights_t, aes(x=Season, y=DepDelay)) + geom_violin(scale = "count")
g4
From this graph, it appears that there is not a significant difference in the departure delayed (positive - true delayed, or negative - leaving early) based on the season. The next graph is a histogram of the Departure delayed faceted by seasons.
g5 <- ggplot(hflights_t, aes(x=DepDelay)) + geom_histogram(binwidth = 10, fill = "pink", colour = "black") + facet_grid(Season ~.)
g5
There again it shows that seasons do not play a very significant role in the distribution of the data. If anything it appears that spring has slightly higher number of delays. The binwidth for this graph is 10. Since the delays are in mininutes. These seemed a reasonabile width to use. To continue our exploration of the data we will try to supperimpose the carrier to determine whether these are a factors in the departure delays.
g6 <- ggplot(hflights_t, aes(x=DepDelay, fill=as.factor(UniqueCarrier))) + geom_histogram(binwidth = 10, position = "identity", alpha = 0.2) + facet_grid(Season ~.)
g6
In this histogram. The carriers are represented with various colours and the data is not stacked but overlayed.
Finally, we will look at the histogram without the season breakdown but keeping the carriers.
g7 <- ggplot(hflights_t, aes(x=DepDelay, fill=as.factor(UniqueCarrier))) + geom_histogram(binwidth = 10, position = "identity", alpha = 0.2)
g7
g8 <- ggplot(hflights_t, aes(x=DepDelay)) + geom_histogram(binwidth = 10, fill = "pink", colour = "black") + facet_grid(UniqueCarrier ~ .)
g8
From this graph, it appears that the carrier has an impact on the departure delayed. However, we would need to study the data based on Frequency (i.e. # of delays/total flights). “XE” has the most flights, followed by “CO”, “WN”, and then “OO”. These will be followed up in a different analysis.
Thank you.