Introduction

This is the first of a series of posts dealing with a complete example of data analysis, starting from the data gathering and finishing with the publication or dissemination of the results. It is the first post of the planned series dealing with the entire Data Science workflow, see The Prediction of Flight Delays within the Data Science pipeline and is seen as the first version of steps 1 to 10 in that list.

We use a few data sets taken from the Airline on-time performance contest http://stat-computing.org/dataexpo/2009/. Note that results are present on the website, but as far as I know, none have been posted which use R, so this is the reason for using it here. In passing we might touch on specific libraries and also on different ways of doing things. In fact we start with a few command line instructions under Linux and may use different tools to show some graphical results.

Because: “the data consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed”, I arbitrarily selected two data sets, the 2007 and the 2008 to perform similar analysis on these. You may ask why not just one. I just to verify that the outcomes of the to are more or less the same. The 2007, downloaded first, is made of over 7 million observations and 29 variables. 2008 as we shall see, is bigger. To spot differences about the recorded data in the course of the years, I will have a look at older data, the oldest in fact, from 1987. We note that this dataset is smaller so (as you may guess) data increased in time. We will try to see whether this is just due to increased traffic or whether more information (features) was recorded, or perhaps both.

Data download and extraction (command line)

From the supplemental data sources page, I also downloaded airport.csv (information on the airports in the dataset, including their coordinates) and about the carriers (the airlines). From the airport.csv and carriers.csv, we see that there were 3376 active airports and 1491 carriers operating between those as of 2008 (which is the most recent dataset). If you want to make things more complex, weather and individual planes datasets are available from the same page [we might look at this later in this series]

To work comfortably in R Studio, I wish to reduce the amount of data by making a few cuts. First, I gather the dataset field names and save them in a text file (one line header file, later imported in R studio) with a quick command line (CL) one-liner, by invoking awk http://linux.about.com/od/Bash_Scripting_Solutions/a/How-To-Write-Ask-Commands-And-Scripts.htm

awk -F“,” ‘NR==1{print $0}’ OFS=“,” 2007.csv>2007hdr.csv

where: - F helps delimiting with comma (-F“,” as we are dealing with a .csv file); - limit to the first row (NR==1); - ask the output (file containing the header, 2007hdr.csv) to be comma delimited (OFS=“,”)

I now use the fields list to subset the original data (for the meaning of the fields, one may consult the TranStats at the US Department of Transportation, which reports on flights performances (http://www.transtats.bts.gov/TableInfo.asp?Table_ID=236). I decided to filter the downloaded dataset for Destination = San Francisco, like this:

awk -F“,” ‘$18 == “SFO” {print}’ 2007.csv > 2007SFO.csv

where: - $18 is the ‘dest’(=destination) column, the 18th of the file, requiring ‘SFO’ (SFO, San Francisco, by peeking into airport.csv). This reduces the file size from the initial 7 millions+, to 138 422 observations of all the flights to SF during 2007 (or less than 2% of the 2007 traffic).

I repeat these two operations for the 2008 data , including the header (because we don’t know whether the data structure field names etc, is identical for the two files, 20007 and 2008).

For 2008 though, we choose a different destination, say JFK in New York. For the header: awk -F“,” ‘NR==1{print $0}’ OFS=“,” 2008.csv>2008hdr.csv For the data: awk -F“,” ‘$18 == “JFK” {print}’ 2008.csv > 2008SFO.csv

We note that we could have done these two operations by means of a complete awk script, but we prefer a simple one-liner for each and resort to R later to unify the two files.

The following analysis is done primarily for SFO 2007. Later in the series we deal with the other datasets.

Preliminaries: the 2007 data

I load the header file produced with awk and append the data to it. Then I look at its format:

#---------------------------------------- 0. preliminary
# Files in the right environment (session to surce file)
if(!file.exists("./DATA")){dir.create("./DATA")}  
setwd("./DATA")
dateDownloaded<-date()

#-----------------------------------------1. read data
# File is still big and so we use the fast read table from data.table package
suppressMessages(library(data.table))
DT2007 <-fread("2007SFO.csv", data.table=TRUE)  # read as a data table, not dataframe

# read header file produced above to provide col names
Hdr2007<-read.csv("2007hdr.csv", header = TRUE, sep =  ",")

# Assign column names from the header file produced above
setnames(DT2007, names(Hdr2007))


# have a look at the vars
sapply(DT2007, class) 
str(DT2007)
dim(DT2007)
head(DT2007, n=5)
tail(DT2007, n=5)

Looking at the data, we note that: * times (DepTime, CRSDepTime, etc) are expressed as integers, for example “1545” means “15:45”. We may need to grep-transform in time format later; * similarly, Elapsed Times (the time computed from gate departure time to gate arrival time) are expressed as integers; * Variables, like UniqueCarrier, FlightNumber, TailNumber, Origin, etc) are coded as integer or character. We may need to change these to factors (categorical) when needed. * Delays are indicated as integers (in minutes; an explication guide can be found at http://aspmhelp.faa.gov/index.php/Types_of_Delay or http://www.transtats.bts.gov/Fields.asp?Table_ID=236).

Finding missing values (NA’s) and their possible imputation

A bit redundantly, I show that we could just count these:

# -------------------------------------- 2. Exploration (I): NA's: find & show these
 
# count 
colSums(is.na(DT2007)) # 

Or use an in-line function:

# ...or look at % of missing data with a function
missVal <- function(x){sum(is.na(x))/length(x)*100}
apply(DT2007,2,missVal)  # ...along columns 

Or provide a visual map of their location, by aggregating (in order to attribute colors to missing=g and non-missing values)

aggr_plot <- aggr(DT2007, col=c('darkred','wheat'), numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram Missing DT2007 data","DT2007 missing Pattern"))

[I previously used the Amelia package which offers similar functionality, but I prefer VIM for the plotting purpose, especially when the dataset is large]

We note that the column CRSElapsedTime has 4 values missing. A few others show between 2753 and 2915 missing values (Arrival or Departures and corresponding delays) each amounting roughly to 2% of this 2007 data, therefore an imputation would be feasible without incurring in too severe errors [For example, we could use mean, median (for numerical) or mode (for factors, when present); or a decision tree as prediction model for missing values; or use the Mice package (interesting package from CRAN repository, which also offers feedback about the plausibility of the imputation and graphical visualization). But we are dealing with scheduled departure times and probably an assignment based on flight number should suffice to assign the departure time univocally. By imputing, we would assume that these flight did actually take off, which is uncertain, and similarly we would ‘invent’ delays that might not have happened.

In summary, as the NA’s amount to a small fraction of the data and are localized in the same ‘data region’, as the map shows, I decide to drop them altogether (note in the prediction part of this post, we’ll choose a different view).

# dropping records whose variables are NA's (re-copying original datatable to dataframe
SFO2007<-as.data.frame(DT2007)
SFO2007<-SFO2007[!is.na(SFO2007$CRSElapsedTime),]
SFO2007<-SFO2007[!is.na(SFO2007$DepTime),]
SFO2007<-SFO2007[!is.na(SFO2007$ArrTime),]
SFO2007<-SFO2007[!is.na(SFO2007$ArrDelay),]
SFO2007<-SFO2007[!is.na(SFO2007$DepDelay),]

# we just check the removal, by mapping again:
aggr_plot <- aggr(SFO2007, col=c('darkred','wheat'), numbers=TRUE, sortVars=TRUE, 
         labels=names(data), cex.axis=.7, gap=3, 
         ylab=c("Histogram Missing SFO2007 data","SFO2007 missing Pattern"))

As expected, there is no histogram in the left part of the plot, an so no missing patterns in the right hand side.

Get acquainted with the data

A systematic approach could be to cross-plot and examine correlations among the features. There are a number of ways for doing this (see for example: [http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software]).

Here, we just show how to visualize a correlation matrix in a scatter plot, a heat map and a correlogram, by selecting some of the columns. The choice of the method is left to you, depending on what you see and on the clarity of reading (which may depend on the dataset). Note that the values of the correlation matrix need to be numeric. So we ignore the first column (all the same, as year is fixed), all the ones containing time strings, and the delays, except ArrDelay, as those might be seen as sub-categories of “delay”, a thing we may wish to check separately):

suppressMessages(library(PerformanceAnalytics))
suppressMessages(library(corrplot))

mycorr <- cor(SFO2007[, c(2,3,4,12,13,14,15,16,19)])

The scatter plot shows the distribution of each variable in histogram form on the main diagonal, the bi-variate scatter plots with a fitted line on the bottom corner; and the correlation coefficient and the “star” level on the top corner (the stars bearing a significance according to the mapping: p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(‘’, ’’, ’’, ‘“.”’, " “):

# the scatter plot of the correlation matrix
chart.Correlation(mycorr, histogram=TRUE, pch=19)

Or using a heat-map, with the color palette, default blue for positive correlation and red for anti-correlation:

col<- colorRampPalette(c("blue", "white", "red"))(20)
# the heatmap
heatmap(x = mycorr, col = col, symm = TRUE)

Or, perhaps with more readable correlogram (just plotting the upper triangle), where blue circles indicate positive correlations, red ones negative correlation, and where we ask (i) reordering, using hierarchical clustering (parameter “hclust”), (ii) ask to include the correlation coefficient (addCoef.col = “black”) and to hide the correlation on the main diagonal:

# the correlogram
corrplot(mycorr, type="upper", order="hclust", addCoef.col = "magenta", tl.col="black", tl.srt=45, diag=FALSE )

[a similar result can be obtained by using the ggcorr function in the GGally package]

The previous results show some obvious correlations: ArrDelay upon DepDelay; and strong correlations involving Distance and Actual - CRSElapsed - and Air times.

Extreme values

As an example, here we seek extreme values of Arrival Delay (ArrDelay)

#---------------------------------------- 3. exploration: extreme values

# Large scale differences among features - delays?
max(SFO2007$ArrDelay); min(SFO2007$ArrDelay);

We see that values span from a dreadful 1166 minutes (over 19 hours!) to -80; so negative values are possible, indicating arrivals ahead of the scheduled time. How common are these values? We may enlist all the flights landed to SFO more than two hours behind schedule, or ahead of the scheduled time, respectively:

TwoPlusDelay<-SFO2007[SFO2007$ArrDelay>120,] # 3589 flights landed with > 2 hours delay
AheadOfTime<-SFO2007[SFO2007$ArrDelay<0,]  # 64375  ahead of time arrivals

But, sticking to Arrdelays, the best is to plot the overall distribution of arrival delays per day of week (for example). First, we assign to the numerical DayOfWeek their names, for example by creating a day of week data-frame, DF, and join it with the selected elements of the SFO207 which show some delay (ArrDelay>0)

suppressMessages(library(dplyr))
# enumerate the days the US way to make a number to name dataframe
DOW <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
sDOW <- substr(DOW, 1, 3)  # short version
DayOfWeek<-c(1,2,3,4,5,6,7) # with corresponding number
DF <-as.data.frame(cbind(DOW, sDOW, DayOfWeek)) # and bind these in a dataframe

# From SFO2007 dataset, take only occurrences showing a delay
HaveDelay<-filter(SFO2007, ArrDelay>0) 
HaveDelay$DayOfWeek<-as.factor(HaveDelay$DayOfWeek)
# to get day names, join the to dataframes by hte common variable (rows of HaveDelay found in DF)
HaveDelay<-right_join(HaveDelay, DF, by="DayOfWeek")

# similarly to have text legend in subsequent plot
#SFO2007$DayOfWeek<-as.factor(SFO2007$DayOfWeek)
#SFO2007<-right_join(SFO2007, DF, by="DayOfWeek")

NDelays <-count(HaveDelay, DayOfWeek) # count occurrences of each day of week

Then load the relevant graphics libraries and produce two plots, one showing the delayed occurrences per day of the week, the other the overall delays distribution *their duration) per day of the week:

suppressMessages(library(ggplot2))
suppressMessages(library(grid))
suppressMessages(library(gridExtra))
suppressMessages(library(cowplot))
suppressMessages(library(ggthemes))
suppressMessages(library(plotly))
 
p1<-ggplot(NDelays, aes(x = NDelays$DayOfWeek, y = sort(NDelays$n, decreasing = TRUE), fill=DF$sDOW)) + geom_bar(stat="identity", width = 0.6) +
  theme_gray() +
  scale_x_discrete(labels=DF$sDOW) +
  coord_cartesian(ylim=c(0, 15000), xlim=c(0.3,7.5)) + 
  theme(panel.grid.major = element_line(size = 0.5)) +
  theme(panel.grid.minor = element_line(colour = "grey", linetype = "dotted")) +
  xlab("\nDay of Week")+  ylab("Counts")+
  theme(legend.position="none") 
# p1 # uncomment to show single plot

p2<-ggplot(SFO2007, aes(SFO2007$ArrDelay, sort(SFO2007$DayOfWeek))) +
  geom_bar(stat = "identity", aes(fill=DayOfWeek)) +
  theme_gray() +
  coord_cartesian(ylim=c(0, 15000), xlim=c(-90, 360)) + 
  scale_x_continuous(breaks=seq(0, 540, 60)) + 
  scale_y_continuous(breaks=seq(0, 15000, 5000)) +    
  xlab("\nDelay (minutes)")+  ylab("Counts") +
  theme(legend.text = element_text(colour="blue", size = 8)) +
  theme(legend.justification=c(1,0), legend.position=c(0.9,0.2))
# p2  # uncomment to show single plot

# show the two plots
grid.arrange(p1, p2, ncol = 2, top = "SFO 2007 Inbound Flight Delays \nper day of week in 2007\n")

We see that the delay distribution is centered around 0 (so luckily planes tend to be on time) and is perhaps Poisson or negative binomial (we may want to dig in deeper on this later on).

[note that we invoked the interactive library plotly; should we wish to manually explore values, it suffices to enter the command ggplotly(p1) instead of just “p1” or “p2”“].

We may ask how many delays occur on a Sunday against Wednesday or Friday to discover that delays are more frequent on Wednesdays:

# subset (for visibility) to extract observations from a given day and whose delay is two
# or more hours. Note that in the US, 1=Sunday, 2=Monday, ...7 = Saturday
DF2007<-SFO2007  # change name as might want to use original later
cond1<-DF2007$DayOfWeek==1 # sunday  
cond4<-DF2007$DayOfWeek==4 # wednesday 
cond6<-DF2007$DayOfWeek==6 # friday 

cond120<-DF2007$ArrDelay>=120 # two hours or more (120 min, thus the name) 

suppressMessages(library(dplyr))
# using the conditions above, extract the relevant observations
Sunday<-filter(DF2007, cond1 & cond120)
Wednesday<-filter(DF2007, cond4 & cond120)
Friday<-filter(DF2007, cond6 & cond120)

Although to put things on an equal footing, these results should be normalized to the total number of flight for those days:

FlightSun<-nrow(filter(DF2007, cond1))
NSunday<-nrow(Sunday)/FlightSun*100  # about 2.26 %

FlightWed<-nrow(filter(DF2007, cond4))
NWednesday<-nrow(Wednesday)/FlightWed*100 # about 2.88 %

FlightFri<-nrow(filter(DF2007, cond6))
NFriday<-nrow(Friday)/FlightFri*100 #  about 1.83 %

So it would be best to fly to SFO on Fridays.

Questions, questions, questions

From the traveler point of view, more fine grained questions could be asked, like “What is the best time of the year to fly between NYC and San Francisco, given that I wish to organize a meeting for 20 people at our headquarters, in Cupertino, Ca. for the next Big Data conference?”, and many more similar questions.

A transportation operator might wonder about the efficiency of the system, in terms of distances and effective flight durations (“What is the dependency of the fraction of the in-flight time to the total flight time for different airlines: can we spot favored airlines or origin airports- e.g. do flights coming from NYC get precedence with respect to others? And further, in terms of delay distributions per: time of the day, of the week, of the month, for a specific airline (UniqueCarrier), for a specific plane (field TailNum) or airport (Origin), and so on.

Looking at the extreme values For distances, we note that values as low as 77 miles are possible:

# Large scale differences among features (II) - distances?
max(DF2007$Distance); min(DF2007$Distance)
## [1] 2704
## [1] 77

How common are these? Does it make sense to fetch a flight for such a short distance? We may ask what is the duration of flights whose distances are this low, by considering the values of the variable AirTime, or in fact just collecting all the interesting columns for such an observation:

SelCols<-select(DF2007, AirTime, Origin, Distance)
NotSoWorthFlying<-DF2007[DF2007$Distance==min(DF2007$Distance), ]

Which shows an hour and 7 minutes between departure from Redding, Ca. (RDD) and SFO (AirTime of 47 min plus 19 min of taxing).

Thus, aside of fuel considerations and geographical distance, a train becomes competitive when able to carry a comparable number of passengers between city centers at a speed greater or equal than about (77 miles / 47 minutes = 77*1.60934/0.783=158.3 Km/h) at a competitive cost. This is within the range of ‘fast’ american trains (~100 mph).

We may ask about terrestrial distances between airports, that could be serviced by a fast train. Thus, extract all Origin airports connected to San Francisco, whose distance to SFO is covered at average speeds below or equal the value found. This would give us a candidate list of train-connectable destinations. [Obviously, we would also need to know whether a railway infrastructure does exist between those destinations. Further, the result should then be generalized to ALL potential nodes of the network (which, in the chosen dataset is just SFO-centric). Along this line, we might, in increased order of granularity:

  1. plot the number of flights getting to SFO each week This can also give us a measure of the weekly passenger traffic, although passengers numbers are not reported in this dataset. But we may assume that planes tend to be pretty filled up these days, and estimate the passenger traffic: from the TailNumber, by accessing the plane information in the FAA dataset(manufacturer + model, obtain the number of built-in seats and estimate a filling ratio).

  2. rank the flights to SFO by distance, and/or duration of the flight …and perhaps invent a speed index = distance/duration and then rank for the ‘slowest’, so to restrict our list of “train connectable candidate routes”.

                 .  .  .  .

Some predictions

We now draw our attention to some predictive task. Above, we dropped a few thousands observation altogether on the basis of unavailable data (NA’s). We now attempt a prediction of those missing data, specifically the delays. In fact, all the other missing data are scheduled data, which are ‘predictable’ just by looking at the Operator, origin, etc. [We note that this issue of Imputation can be addressed with ML algorithms, several of which are available in R packages, like the mentioned MICE library, which offers a vast range of imputation functions].

We can do this in a number of ways.

  1. First, we can predict whether a flight will be delayed: this is a classification problem. A standard Random Forest can be run. To do this, we first define a dependent variable, our target, which is Delayed, of ‘logic’ type (yes = 1 or No = 0);

  2. Secondly, we can predict the amount of the delay, which is a continuous variable. We could do this outright from start with some regression algorithm, or directly with a random forest in ’regression mode, or as a step following the classification done in 1 (find delayed flights).

For the purpose of this exercise, I limit to the classification step. We may look at the delay amount (in minutes) later on.

From the exploration above, we know that some flights actually arrive ahead of scheduled time, so we introduce a second (potential target) variable, Ahead=1 (yes) or 0 (no). This allows for different predictions if we want to (flights which are ahead of schedule).

We first make sure that the factors we need are defined on the entire dataset before splitting and performing training; otherwise we may get into trouble if the levels differ, when doing prediction. Then we extract those missing value records and assign them to the Test set, before splitting the remainder into train and validation.

Data preparation (I): new target variable/s

# convert variables into factors: make sure all variables are numeric or factors, by appropriate conversion. 
new_DF <- as.data.frame(DT2007) # first the read table as a data frame

# Construct the target variable, logical delay = 1, versus 0. We add as a column of zero as initial value (No Delay)
new_DF[, "Delay"] <-0  # no delay

# Similarly for our Ahead (arrival ahead of scheduled will be true=1 when Arrdelay is less than zero)
new_DF[, "Ahead"] <-0  # no early (ahead of schedule) arrival

# Delay: attribute a '1' if Arrdelay is greater than zero
new_DF$Delay[which(new_DF$ArrDelay>0)]<-1

# Ahead: attribute a '1' if Arrdelay is less than zero
new_DF$Ahead[which(new_DF$ArrDelay<0)]<-1

### Data preparation: check / evaluate unbalance - overall counts
# Since we limit here to the Delays, we just count 
sum(new_DF$ArrDelay>0, na.rm=TRUE)
sum(new_DF$ArrDelay<0, na.rm=TRUE)
sum(new_DF$ArrDelay==0, na.rm=TRUE)
sum(is.na(new_DF$ArrDelay))

# just a check
if(nrow(new_DF)!=(sum(new_DF$ArrDelay>0, na.rm=TRUE))+ sum(new_DF$ArrDelay<0, na.rm=TRUE) +
          sum(new_DF$ArrDelay==0, na.rm=TRUE) + sum(is.na(new_DF$ArrDelay)))
{ print("things do not add up porperly, check your data!")}

What have we done here? In principle, the introduction of variables which are correlated to existing ones, may influence the goodness of our ‘fit’. This is a good reason to use tree/RF which are robust against variables correlations.

In passing, we also note that quite a minority of flight arrive exactly on schedule (ArrDelay=0), a mere 2.3%, whereas, somewhat unexpectedly, the number of flights arriving ahead of schedule, at 46.5% is comparable to the number of delayed flights, which is 49% of the total.

Data preparation (II): variable formats

# 1. FACTORS
new_DF$Month<-factor(new_DF$Month)
new_DF$DayOfWeek<-factor(new_DF$DayOfWeek)
new_DF$DayofMonth<-factor(new_DF$DayofMonth)          # the max < than max. Nr. of levels in RF
new_DF$UniqueCarrier<-factor(new_DF$UniqueCarrier)    # Carrier: 17 levels (IATA)
new_DF$Dest<-factor(new_DF$Dest)                      # Destination: just 1, 'us' (SFO)

# The new ones: since they are categorical, tell R so!
new_DF$Delay<-as.factor(new_DF$Delay)
new_DF$Ahead<-as.factor(new_DF$Ahead)

# (we might drop these two variables in the model, see below)
new_DF$FlightNum<-factor(new_DF$FlightNum)       # Flight number: 1142 levels ("flight code")
new_DF$TailNum<-factor(new_DF$TailNum)           # Tail Number: 3272 levels

# this column has no values
new_DF$CancellationCode <- NULL

We note that some variables have a large number of levels and we may need to ‘cut on those’, either because of the chosen algorithm - standard RF algorithm in R library randomforest is limited to 53 levels - as of May 2016 - or because of performances issues (memory, CPU time). For instance, FlightNumber and TailNumber have over 1000 levels. Origin has 66. [Here we are ignoring the problem altogether. Later, in Part II, we show how to deal with multi-level factors in a systematic way].

Data preparation (III): variables with high number of levels

For Origin (departure airport) we can split the levels based on airport Origin whose distance is higher than a threshold value. Let’s see first how the flights are distributed per airport of origin:

new_DF$Origin<-factor(new_DF$Origin)                  # Origin: 66 levels

Origs <-count(new_DF, Origin)   # count occurrences of each airport of Origin 
Dists<-count(new_DF, Distance)  # distances, how many are there? As many as Origin
# (so, btw, no route to SFO is covered by more than one operator)

suppressMessages(library(ggplot2))
suppressMessages(library(ggthemes))
ggplot(Origs, aes(Origin,sort(Origs$n,decreasing = TRUE)))+
geom_bar(stat = "identity", aes(fill=Origin))+
scale_x_discrete(breaks=NULL)+
coord_cartesian(ylim=c(0, 12000)) + 
scale_y_continuous (breaks=seq(0, 12500, 1000))+  # Ticks  
xlab("\nOrigin")+  ylab("Counts")+
annotate("text", x = 20, y = 6000, label = "San Francisco all inbound \n flights per 
         Departure Airport", colour="black", size = 5)+
theme(legend.text = element_text(colour="blue", size = 7))+
annotate("rect", xmin = 6, xmax = 32, ymin = 113000, ymax = 141000, alpha = .1)+
theme(legend.justification=c(1,0), legend.position=c(1,0))

We will split the levels so that those beyond a chosen threshold are called “Far”. We keep the distances (equivalently the origin) as they are until PMD (Palmdale, Ca.) airport included (to be safe with 50 levels for the factor), which corresponds to distances greater than 316 miles, see below:

# as all values are the same, we pipe multi-record selection to threshold value
suppressMessages(library(plyr))
suppressMessages(library(dplyr))
ThreshDist<-((new_DF[new_DF$Origin=="PMD", "Distance"])%>%unique())
ThreshDist  # we get 316 terrestrial miles (US)
## [1] 316

Then we assign the remaining levels to airport Origin that are more distant, by defining a new feature “FarOrigin”, assigned with all Origin airports beyond the threshold.

# observations with distances greater than 316 miles
new_DF$FarOrigin <- new_DF$Origin               # copy all values in new feature
new_DF$FarOrigin <- as.character(new_DF$FarOrigin)     # transform into char
new_DF$FarOrigin[new_DF$Distance >= ThreshDist] <- 'far'  # assign FAR when > threshold
new_DF$FarOrigin <- factor(new_DF$FarOrigin)  # re-factor

# drop the flight and tail in train, test and val
new_DF$FlightNum <- NULL
new_DF$TailNum <- NULL

# since all data refers to 2007, we drop the year column
new_DF$Year <- NULL

# since all data refers to SFO as detination, we drop destination
new_DF$Dest <- NULL

Data splitting on NA’s criterion

Finally, we can proceed with the data split into training, validation and testing. We take the data with no missing values for the temporary train, TrainTmp, the rest to the Test. We recall that we are trying to predict the missing values in arrival delays. Before the final split of TrainTmp in the true Train and Validation, we just show the two NA maps to check this split worked well. The first for the test (missing values in magenta), the second for the train (where of course, we expect none)

# 2. identify and extract records containing ArrDelay with NA values from initial data table
# and put these into a Test Dataframe. The complementary goes in the temporary training

suppressMessages(library(VIM))
# NA values to the Test
Test<-subset(new_DF, is.na(new_DF$ArrDelay))
aggr_plot <- aggr(Test, col=c('blue','magenta'), numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram Missing Test data",
                                                                 "Test missing Pattern"))

# Not NA to the temporary train
TrainTmp<-subset(new_DF, !is.na(new_DF$ArrDelay))
aggr_plot <- aggr(TrainTmp, col=c('darkred','wheat'), numbers=TRUE, sortVars=TRUE, 
                  labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram Missing Train data",
                                                                 "Train missing Pattern"))

# Now Split remaining into Train and Validation with caret package
suppressMessages(library(caret))

# split index on ArrDelay, 70% to train, rest to validation
indxTrain<-createDataPartition(y=TrainTmp$ArrDelay, p=0.70,list=FALSE) 
Train<-TrainTmp[indxTrain,]
Valid<-TrainTmp[-indxTrain,] # - means, the remaining
dim(Train); dim(Valid); dim(Test); # check the files obtained
if ( (nrow(Train) + nrow(Test) + nrow(Valid)) != nrow(new_DF)) { print("dim not matching, check the splits")}
# for convenience and potentail further modification, we use three different file names Trn, Val, Tst:
Trn<-Train
Tst<-Test
Val<-Valid

Train the model

The presence of heterogeneous variables may call for the use of a cforest (package party), but in this version we limit to the popular RF within the randomForest package. Later we will compare with the results of the cforest [Further comparison among different RF implementations is delayed to a further post]. We will then look at the variable importance and for the performance we will use ROC curves.

# -------------- RF - initial run - TEST 1 
suppressMessages(library(party))    # contains conditiona inference trees algorithms (also cforest)
suppressMessages(library(partykit)) # toolkit to inspect trees
suppressMessages(library(randomForest)) # general RF - should we wish to compare

# following two for outcome/ROCs
suppressMessages(library(pROC))
suppressMessages(library(ROCR))
# library(wsrf) # weighted, for unbalanced data

For the choice of the number of variable per level to choose to do tree splitting, mtry, the default “rule of thumb” gives mtry=sqrt(p) for classification. Here p=27, so we take mtry~5.

# --------- RF from package random forest
set.seed(41235)
start.time <- Sys.time()
ftDelay = randomForest(Delay ~ 
    Month + DayofMonth + DayOfWeek + DepTime + CRSDepTime + ArrTime + 
    UniqueCarrier +   ActualElapsedTime + CRSElapsedTime + # ArrDelay + 
    AirTime + DepDelay + Distance + TaxiIn +   TaxiOut + Cancelled + 
    Diverted + CarrierDelay + WeatherDelay + NASDelay + SecurityDelay + 
    LateAircraftDelay, data = Trn, mtry=5, importance=TRUE, ntree=200) 

# record the execution time
end.time <- Sys.time()
time.takenD <- end.time - start.time
time.takenD

# looking at the variable importance
varImpPlot(ftDelay, main = "Variable importance RF\n Accuracy Decrease and Gini index", pch=16,col='blue')

plot(ftDelay,main='Error vs No. of trees plot: Base Model')

Note that we included the variable importance to get a plot of the mean accuracy decrease and the mean Gini index decrease for the variables included in the fit. As expected, the most important variable is the departure delay: clearly if a flight departed with some Delay, it is likely that its arrival will be delayed. Also, the TaxiOut ranks consistently among the first four. Three Delays, Cancelled and Diverted are the last four. Note the latter two have very few values in the dataset ~1% or less). For the model itself, we see that there might be room for error reduction by increasing the number of trees beyond the set 200, although the improvement seems only marginal, so we leave the values set and delay the systematics of the RF investigation to a separate post.

Something similar happens to the Ahead of schedule flights:

# --------- Std RF from package random forest
set.seed(41235)
start.time <- Sys.time()
ftAhead = randomForest(Ahead ~ 
    Month + DayofMonth + DayOfWeek + DepTime + CRSDepTime + ArrTime + 
    UniqueCarrier +   ActualElapsedTime + CRSElapsedTime + # ArrDelay + 
    AirTime + DepDelay + Distance + TaxiIn +   TaxiOut + Cancelled + 
    Diverted + CarrierDelay + WeatherDelay + NASDelay + SecurityDelay + 
    LateAircraftDelay, data = Trn, mtry=5, importance=TRUE, ntree=200) 

# record the execution time
end.time <- Sys.time()
time.takenA <- end.time - start.time
time.takenA

# looking at the variable importance
varImpPlot(ftAhead, main = "Variable importance RF\n Accuracy Decrease and Gini index", pch=16,col='blue')

plot(ftAhead,main='Error vs No. of trees plot: Base Model')

Apply model to Validation - Evaluation and ROC curves

We run the model to the validation data set (Val) and show the confusion matrix.

# run the model on validation
set.seed(41235)
forestVal = predict(ftDelay, newdata=Val)  # this is the prediction object and we seek the probability

table(forestVal, Val$Delay)                # confusion matrix for validation set
##          
## forestVal     0     1
##         0 19023  1516
##         1  1298 18813

We now run the ROC curves routines and evaluate the area under the curve

# ------------- the ROC curves
# First, produce the predicted class probabilites
forest.Val = predict(ftDelay, type='prob', newdata=Val)[,2]  # predict mode is 'prob'
# note, we select the second column of predicted probabilities

# the prediction object will be made passing the predicted class probabilities and the actual labels from Validation
forestpred = prediction(forest.Val, Val$Delay)   

# the predicted object is passed to performance, asking for TP rate and FP rate
forestperf = performance(forestpred, 'tpr', 'fpr')  

plot(forestperf, main='ROC - Random Forest Delay prediction', colorize=T)
abline(a=0,b=1,lwd=2,lty=2,col="gray")    # show the 50% diagonal 

# AUC (Area under the curve)
Pred.Val.AUC<-performance(forestpred, measure="auc")@y.values
Pred.Val.AUC
## [[1]]
## [1] 0.9851274

We obtain ~ 98.6 % of prediction accuracy. On this platform the run-time amounts to more than three minutes. We may want to see what happens when retaining only the most important variables in terms of both accuracy and execution time [A more systematic comparison among RF algorithms and varying feature selection is the subject of a separate oncoming post].

To select the ‘best variables’ we check-in arbitrarily the first 6 from both sets, namely the 8 features: DepDelay, TaxiOut, CRSElapsedTime, ActualElapsedTime, Month, NASDelay, CRSElapsed, DayofMonth. We re-run the fitDelay with only these variables (later we see how to implement this as a function), and reduce the mtry accordingly:

# --------- Std RF from package random forest (II): reduced variables
set.seed(41235)
start.time <- Sys.time()
ftDelay2 = randomForest(Delay ~ DepDelay + TaxiOut + CRSElapsedTime + ActualElapsedTime +                         Month + NASDelay + CRSElapsedTime + DayofMonth, 
                    data = Trn, mtry=3, importance=TRUE, ntree=200) 

# record the execution time
end.time <- Sys.time()
time.takenD2 <- end.time - start.time
time.takenD2

# looking at the variable importance
varImpPlot(ftDelay2, main = "Variable importance RF N.2 \n Accuracy Decrease and Gini
index", pch=16,col='blue')

# confusion matrix
ftDelay2$confusion

# plot the RF fit
plot(ftDelay2, main='Error vs No. of trees plot: Model 2')

Then re-run the model on the validation set…

# run the model on validation (II): reduced variables
set.seed(41235)
forestVal = predict(ftDelay2, newdata=Val)  # this is the prediction object and we seek the probability

table(forestVal, Val$Delay)                # confusion matrix for validation set
##          
## forestVal     0     1
##         0 18971  1266
##         1  1350 19063

…and run the ROC curves routines and evaluate the area under the curve

# ------------- the ROC curves (II): reduced variables
# First, produce the predicted class probabilites
forest.Val = predict(ftDelay2, type='prob', newdata=Val)[,2]  # predict mode is 'prob'
# note, we select the second column of predicted probabilities

# the prediction object will be made passing the predicted class probabilities and the actual labels from Validation
forestpred = prediction(forest.Val, Val$Delay)   

# the predicted object is passed to performance, asking for TP rate and FP rate
forestperf = performance(forestpred, 'tpr', 'fpr')  

plot(forestperf, main='ROC - Random Forest Delay prediction', colorize=T)
abline(a=0,b=1,lwd=2,lty=2,col="gray")    # show the 50% diagonal 

# AUC (Area under the curve)
Pred.Val.AUC<-performance(forestpred, measure="auc")@y.values
Pred.Val.AUC
## [[1]]
## [1] 0.9867611

The increase in accuracy revealed both by the confusion matrix and the AUC upon reduced features and execution time is evident. One would be encouraged to check if further improvements are possible, but for now, we consider the ~98.7% accuracy sufficient for our purpose.

Summary so far

In this exercise, I picked up a dataset of medium size to do analysis in a reasonable time in order to show the steps involved in a prediction (classification) problem. A dataset was subset to a manageable size for a personal laptop. The selection was based on the year and performed at the command line. Upon import in R (using the quicker data table format), an initial exploration, checks on correctness, and an examination of missing values with their mapping was done. Thereafter, I proposed a number of possible questions that the user may want to ask.

Subsequently, I proposed to predict whether the flights which did not have values for the delays (which are the observations with missing values or “NA”) could be assigned to a class ‘delayed’ or not and the corresponding target variable was constructed. The prediction has been done as a Random forest binary classification problem. In the process a few variables were dropped and variable importance was looked at. We ended up with a 98.7% prediction accuracy.

Where from here?

A number of possible modifications can be performed to avoid too many ad hoc operations, technically: - automating at least some of the exploration steps, including the tidying up of the data; - investigate ways to perform a model tuning (trees, mtry and errors), also automated ‘as much as we can’ and - manage the variable importance, with functions, by passing them the dataset/s (or a subset) and a formula.

For the problem at hand we may: 1. examine the feature correlations - although these should not influence the RF model- but we may wish to compare the results with alternative models (A question often asked in is whether we should or not drop variables on the basis of collinearity).

  1. The subsequent step would in fact to run a regression to predict the amount of the delay (in minutes). This could be done by subsetting the data on the condition Delay = true

  2. We have encountered a difficulty related to factors and their levels and we just ignored the problem by dropping values(!). There is a systematic approach to deal with this problem, and we will consider it in Part II.

Now then, we have a work-plan for subsequent posts in this series.