This document pertains to the final Project for the R_Bridge Session. We will be exploring the hflights data from the hflights package. All graphs will be done in ggplot2 so this package is also required.

#install.packages("hflights")
require(hflights)
## Loading required package: hflights
require(ggplot2)
## Loading required package: ggplot2

Data Structure:

The data structure can be found at: https://cran.r-project.org/web/packages/hflights/hflights.pdf

For this analysis, we will consider only a subset of the data:

Only the top 4 carriers will be considered. Using a quick histogram base on “UniqueCarrier” (see graph below), it is clear that there are only 4 significant carriers: “XE”, “CO”, “WN”, and to lesser degree “OO”.

Furthermore, information that identify a particular flight or plane will not be relevant in our analysis. Hence, we will remove the following columns from our analysis: “FlightNum” and “TailNum”

Again, with a quick look at the data pertaining to cancellation, it is clear that very few flights are cancelled for any reason. We will therefore filter out flights with a cancelled indicator = 1.

We will focus our analysis on delays and how they relate or not to time of year. To this affect, we will create an additional column based departure date that will indicate season and one based on time of departure that will indicate time of day/night.

Note: We will perform all transformation and filtering on hflights_t, a copy of original dataframe hflights.

Data Transformation:

  1. Filter only data belonging to UniqueCarrier in (“XE”, “CO”, “WN”, “OO”)
hflights_t <- hflights

hflights_t <- hflights_t[hflights_t$UniqueCarrier %in% c("XE", "CO", "WN", "OO"), ]
  1. Filter out any rows pertaining to a Cancellation; keep Cancelled = 0
hflights_t <- hflights_t[hflights_t$Cancelled == 0, ]
  1. Dropping unwanted columns: We will drop the following column from our hflights_t dataframe: FlightNum, TailNum, Cancelled, CancellationCode
drop_columns <- c("FlightNum", "TailNum", "Cancelled", "CancellationCode")
hflights_t <-hflights_t[, !(names(hflights_t) %in% drop_columns)]
  1. Add Columns; 1 for “Season” and 1 for “TimeOfDay” Season: Winter, Spring, Summer, Fall from 01/01/2011 to 03/19/2011 -> Winter from 03/20/2011 to 06/19/2011 -> Spring from 06/21/2011 to 09/22/2011 -> Summer from 09/23/2011 to 12/20/2011 -> Fall from 12/21/2011 to 12/31/2011 -> Winter

TimeOfDay: EarlyMorning, Morning, Afternoon, LateAfternoon, Evening, Night from 401 to 800 -> EarlyMorning from 801 to 1200 -> Morning from 1201 to 1600 -> Afternoon from 1601 to 2000 -> LateAfternoon from 2001 to 2400 -> Evening from 0001 to 400 -> Night

We will map both DepTime and ArrTime. We will write a function and then apply it to column DepTime and ArrTime.

# function to map DepTime or ArrTime to group
timeofday <- function(hhmm){
  if (is.na(hhmm)){
    timeofday_f <- "Unknown"
  } else if(hhmm >= 1 && hhmm <= 400){
    timeofday_f <- "Night"
  } else if (hhmm <= 800){
    timeofday_f <- "EarlyMorning"
  } else if (hhmm <= 1200){
    timeofday_f <- "Morning"
  } else if (hhmm <= 1600){
    timeofday_f <- "Afternoon"
  } else if (hhmm <= 2000){
    timeofday_f <- "LateAfternoon"
  } else if (hhmm <= 2400){
    timeofday_f <- "Evening"
  } else{
    timeofday_f <- "Unknown"
  }
  return(as.factor(timeofday_f))
}
hflights_t$DepTimeGroup <- mapply(timeofday, hflights_t$DepTime)
hflights_t$ArrTimeGroup <- mapply(timeofday, hflights_t$ArrTime)

# Mapp Season to Departure Month/Day

monthdaytoseason <- function(month, day){
  if (month == 1 || month == 2){
    season_f <-"Winter"
  } else if (month == 3 && day <= 19){
    season_f <- "Winter"
  } else if (month == 3 && day >= 20){
    season_f <- "Spring"
  } else if (month == 4 || month == 5){
    season_f <- "Spring"
  } else if (month == 6 && day <= 19){
    season_f <- "Spring"
  } else if (month == 6 && day >= 20){
    season_f <- "Summer"
  } else if (month == 7 || month == 8){
    season_f <- "Summer"
  } else if (month == 9 && day <= 22){
    season_f <- "Summer"
  } else if (month == 9 && day >= 22){
    season_f <- "Fall"
  } else if (month == 10 || month == 11){
    season_f <- "Fall"
  } else if (month == 12 && day <= 20){
    season_f <- "Fall"
  } else if (month == 12 && day >= 21){
    season_f <- "Winter"
  } else{
    season_f <- "Unkown"
  }
  return(as.factor(season_f))
  }
hflights_t$Season <- mapply(monthdaytoseason, hflights_t$Month, hflights_t$DayofMonth)
  1. Add DepPos and ArrPos columns, these columns will be set to TRUE if Delay >= 0 and to FALSE otherwise
hflights_t$DepPos <- hflights_t$DepDelay >= 0
hflights_t$ArrPos <- hflights_t$ArrDelay >= 0

Summary Statistics:

  1. Statistics on original data
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 
  1. Statistics on transformed data
summary(hflights_t)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.516   Mean   :15.77   Mean   :3.946  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier      ActualElapsedTime
##  Min.   :   1   Min.   :   1   Length:201955      Min.   : 34.0    
##  1st Qu.:1032   1st Qu.:1230   Class :character   1st Qu.: 76.0    
##  Median :1423   Median :1624   Mode  :character   Median :127.0    
##  Mean   :1416   Mean   :1592                      Mean   :128.7    
##  3rd Qu.:1810   3rd Qu.:1958                      3rd Qu.:166.0    
##  Max.   :2400   Max.   :2400                      Max.   :575.0    
##                 NA's   :83                        NA's   :596      
##     AirTime         ArrDelay          DepDelay          Origin         
##  Min.   : 11.0   Min.   :-70.000   Min.   :-33.000   Length:201955     
##  1st Qu.: 57.0   1st Qu.: -7.000   1st Qu.: -2.000   Class :character  
##  Median :106.0   Median :  0.000   Median :  1.000   Mode  :character  
##  Mean   :107.8   Mean   :  7.374   Mean   :  9.605                     
##  3rd Qu.:142.0   3rd Qu.: 11.000   3rd Qu.: 10.000                     
##  Max.   :549.0   Max.   :957.000   Max.   :981.000                     
##  NA's   :596     NA's   :596                                           
##      Dest              Distance          TaxiIn           TaxiOut      
##  Length:201955      Min.   :  79.0   Min.   :  1.000   Min.   :  1.00  
##  Class :character   1st Qu.: 376.0   1st Qu.:  4.000   1st Qu.: 10.00  
##  Mode  :character   Median : 802.0   Median :  5.000   Median : 14.00  
##                     Mean   : 787.6   Mean   :  5.763   Mean   : 15.16  
##                     3rd Qu.:1076.0   3rd Qu.:  7.000   3rd Qu.: 18.00  
##                     Max.   :3904.0   Max.   :140.000   Max.   :163.00  
##                                      NA's   :83                        
##     Diverted               DepTimeGroup          ArrTimeGroup  
##  Min.   :0.000000   Morning      :50580   Afternoon    :48942  
##  1st Qu.:0.000000   LateAfternoon:53270   LateAfternoon:57409  
##  Median :0.000000   Afternoon    :56067   Evening      :49818  
##  Mean   :0.002951   EarlyMorning :18932   Morning      :40933  
##  3rd Qu.:0.000000   Evening      :22968   Night        : 2753  
##  Max.   :1.000000   Night        :  138   EarlyMorning : 2017  
##                                           Unknown      :   83  
##     Season        DepPos          ArrPos       
##  Winter:48435   Mode :logical   Mode :logical  
##  Spring:51148   FALSE:82981     FALSE:96353    
##  Summer:54698   TRUE :118974    TRUE :105006   
##  Fall  :47674   NA's :0         NA's :596      
##                                                
##                                                
## 

Graphic Analysis

We will analyse the data with respect to departure delay and seasons.

g3 <- ggplot(hflights_t, aes(x=Season, y=DepDelay, fill=DepPos)) + geom_bar(stat="identity", position = "identity") + scale_fill_manual(values=c("#CCEEFF", "#FFDDDD"), guide = FALSE)
g3

For each season, it is clear that there are long delays however, since this graph map the value and not the count it is difficult to identify where the is clustered.

The next graph is a violin graph. Plotting each one per seasons. We specify scale = count to represent the number of occurence. We have opted for a violin graph after trying boxplot and realizing that the presence of ouliers make the graphing difficult. It also made clear that the first graph is not a good representation of the data.

g4 <- ggplot(hflights_t, aes(x=Season, y=DepDelay)) + geom_violin(scale = "count") 
g4

From this graph, it appears that there is not a significant difference in the departure delayed (positive - true delayed, or negative - leaving early) based on the season. The next graph is a histogram of the Departure delayed faceted by seasons.

g5 <- ggplot(hflights_t, aes(x=DepDelay)) + geom_histogram(binwidth = 10, fill = "pink", colour = "black") + facet_grid(Season ~.)
g5

There again it shows that seasons do not play a very significant role in the distribution of the data. If anything it appears that spring has slightly higher number of delays. The binwidth for this graph is 10. Since the delays are in mininutes. These seemed a reasonabile width to use. To continue our exploration of the data we will try to supperimpose the carrier to determine whether these are a factors in the departure delays.

g6 <- ggplot(hflights_t, aes(x=DepDelay, fill=as.factor(UniqueCarrier))) + geom_histogram(binwidth = 10, position = "identity", alpha = 0.2) + facet_grid(Season ~.)
g6

In this histogram. The carriers are represented with various colours and the data is not stacked but overlayed.

Finally, we will look at the histogram without the season breakdown but keeping the carriers.

g7 <- ggplot(hflights_t, aes(x=DepDelay, fill=as.factor(UniqueCarrier))) + geom_histogram(binwidth = 10, position = "identity", alpha = 0.2)
g7

g8 <- ggplot(hflights_t, aes(x=DepDelay)) + geom_histogram(binwidth = 10, fill = "pink", colour = "black") + facet_grid(UniqueCarrier ~ .)
g8

From this graph, it appears that the carrier has an impact on the departure delayed. However, we would need to study the data based on Frequency (i.e. # of delays/total flights). “XE” has the most flights, followed by “CO”, “WN”, and then “OO”. These will be followed up in a different analysis.

Thank you.