Analysis of hflights data set

This dataset contains all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby) in 2011. The data comes from the Research and Innovation Technology Administration at the Bureau of Transporation statistics.

hflights: Flights that departed Houston in 2011

A data only package containing commercial domestic flights that departed Houston (IAH and HOU) in 2011.

Version: 0.1
Depends: R (??? 2.10)
Published: 2013-12-07
Author: Hadley Wickham
Maintainer: Hadley Wickham


The questions I hope to answer is “what factors are associated with cancelled flights? Is there any unusual circumstances associated with cancelled flights?”

I begin by loading the necessary packages and transforming the data

library(hflights)
library(plyr)
library(lubridate)
library(ggplot2)

#set the data frame
df <- data.frame(hflights)

#sub set of observations where the flight was cancelled
cancelled_df <- subset(df, Cancelled == 1)

#create date column by combining the month, day, and year field 
cancelled_df$Date <- lubridate::mdy(sprintf('%s %s %s', cancelled_df$Month, cancelled_df$DayofMonth, cancelled_df$Year))

#drop columns not needed
drop_col <- names(cancelled_df) %in% c("Year","Month", "DayofMonth",      
                                       "DepTime","ArrTime","ActualElapsedTime","AirTime",
                                      "ArrDelay","DepDelay","TaxiIn","TaxiOut","Diverted",
                                      "FlightNum","TailNum","UniqueCarrier",
                                      "Dest","Distance","DayOfWeek")
new_df <- cancelled_df[!drop_col]

#Map CancellationCode to factor
CCode <- c("A","B","C","D")
CName <- c("carrier","weather","national air system","security")
new_df$CancellationCode <- as.factor(mapvalues(new_df$CancellationCode, CCode,CName))

#verification of the data transformation
head(new_df)
##       Origin Cancelled CancellationCode       Date
## 33074    IAH         1          carrier 2011-01-24
## 35264    IAH         1          weather 2011-01-09
## 63546    HOU         1          weather 2011-01-11
## 67826    HOU         1          carrier 2011-01-19
## 72078    HOU         1          weather 2011-01-27
## 74874    IAH         1          weather 2011-01-31
#Sum cancellations by date of the year
sumdf <- data.frame(new_df$Date, new_df$Cancelled)
colnames(sumdf) <- c("Date","Cancelled")

x <- ddply(sumdf,~Date,summarise,"Count" = sum(Cancelled))

So lets look at the summary data

summary(x)
##       Date                         Count        
##  Min.   :2011-01-01 00:00:00   Min.   :  1.000  
##  1st Qu.:2011-04-02 00:00:00   1st Qu.:  2.000  
##  Median :2011-06-21 00:00:00   Median :  4.000  
##  Mean   :2011-06-26 06:48:49   Mean   :  9.379  
##  3rd Qu.:2011-09-18 00:00:00   3rd Qu.:  7.000  
##  Max.   :2011-12-31 00:00:00   Max.   :388.000

Whoa, where is this max coming from? This looks like a huge outlier and could be an interesting situtation so lets graph the time series and see what we can see.

Looks like we have this huge spike in cancelled flights between Jan and Apr, so lets get the dates with the highest amount of cancelled flights.

Most_Cancelled <- subset(x, x$Count > 100) 
Most_Cancelled
##          Date Count
## 27 2011-02-01   152
## 29 2011-02-03   256
## 30 2011-02-04   388

Ok, so we have the dates with the highest observation of cancelled flights. I am going to take the observations from Feb 01 to Feb 04.

Now lets look at the timeframe and the reasons associated with the cancelleations.

k <- subset(new_df, new_df$Date  >= 2011-02-01 | new_df$Date <= 2011-02-04)
k$CancellationCode <- ordered(k$CancellationCode, levels = CName)

So looks like weather caused this spike in cancellations. Knowing that the spike is weather related we can do a quick google search for extreme weather conditions and we find that there was a record breaking blizzard during the time http://www.srh.noaa.gov/tsa/?n=weather_event_2011feb1. While this describes conditions in Oklahoma it borders Texas and is close to Houston so we can see this blizzard had a large impact of air travel in the region.