Project Delayed Flights

Alison Burzynski, Gil de Góes, Walter Pico

11/14/2020

Executive summary

The Project Objective is to classify if flights are on time or delayed.

The AI model chosen was Artificial Neural Networks (ANN) because it is famous for its level of accuracy, even though it has no explanatory value (blackbox).

ANN is appropriate to the project because it is a data-driven method and we have more than 7 million flight registers (only for 2018). In future projects datasets of other years could be added to investigate seasonality as a classification factor.

Problem statement

The problem being addressed in this project is the high number of flight delays versus on time.

The objective of this project is to create an Artificial Neural Network (ANN) algorithm that will classify flights between whether they are on time or delayed.

This algorithm can be useful for Airlines, Airports, and for passengers.

Airlines and Airports could use this algorithm to investigate common flight delays based on their characteristics. By doing so, they can track the most common causes of delays and find a solution in order to solve or decrease delays.

This algorithm can also come in handy for passengers that are either planning a business trip or a getaway vacation to avoid days or select months with greater chances of delay.

This is a classification problem because our outcome variable (dependent variable) is categorical (in this case binary: delayed vs on time)

Definition of delay for this project: Any flight with ARR_DELAY > 15 min

Data Preprocessing

Data Description

Loading libraries

library(h2o) # The best package for ANN. It is open-source, allows the use of a remote server, and lets us allocate the maximum number of threads of the processor.

## Warning: package 'h2o' was built under R version 4.0.5

## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------

## 
## Attaching package: 'h2o'

## The following objects are masked from 'package:stats':
## 
##     cor, sd, var

## The following objects are masked from 'package:base':
## 
##     %*%, %in%, &&, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc

# The h2o library also give us the importance of each parameter to predict the dependent variable 
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'dplyr' was built under R version 4.0.5

## Warning: package 'forcats' was built under R version 4.0.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(caTools)

## Warning: package 'caTools' was built under R version 4.0.5

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:h2o':
## 
##     day, hour, month, week, year

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Reading dataset and creating a backup

flights.df <- read.csv("2018.csv")
flights.df.bkp <- flights.df #Backup

First steps to clean the dataset

flights.df <- flights.df.bkp # If need to read the file from scratch, start reading this block of code to use the backup of the loaded file instead of waiting for the file to load again.

set.seed(123)
flights.df <- sample_frac(flights.df, 0.1) # Using only 20% of the data frame

# Removing records that are useless or have too many missing data
# Cancelled or diverted flights have no use
flights.df <- flights.df[flights.df$CANCELLED == 0,]
flights.df <- flights.df[flights.df$DIVERTED == 0,]

# Removing other records with no ARR_DELAY value (ARR_DELAY = NA)
flights.df <- flights.df[!is.na(flights.df$ARR_DELAY),]

Encoding the categorical variables as numeric factors. This step is important for the package h2o

OP_CARRIER.unique <- unique(flights.df$OP_CARRIER, incomparables = FALSE)
flights.df$OP_CARRIER_FAC = as.numeric(factor(flights.df$OP_CARRIER, 
                               levels = c("UA", "AS", "9E", "B6", "EV", "F9", "G4", "HA", "MQ", "NK", "OH", "OO", "VX", "WN", "YV", "YX", "AA", "DL"),
                               labels = c(1:18)))


ORIGIN.unique <- unique(flights.df$ORIGIN, incomparables = FALSE)
flights.df$ORIGIN_FAC = as.numeric(factor(flights.df$ORIGIN, 
                               levels = c("EWR", "LAS", "SNA", "RSW", "ORD", "IAH", "DEN", "SMF", "RIC", "PDX", "MCO", "TYS", "SFO", "JAC", "BOS", "MSY", "MIA", "SEA", "SAT", "SLC", "RDU", "FLL", "IAD", "DFW", "ANC", "MSP", "ALB", "LAX", "IND", "SAN", "BNA", "BDL", "ABQ", "SAV", "PHX", "AUS", "PHL", "SJC", "ORF", "DCA", "LGA", "BWI", "PIT", "OGG", "CLE", "TPA", "MYR", "ROC", "SJU", "EGE", "HNL", "ONT", "PBI", "MKE", "HDN", "JAX", "OKC", "ATL", "SRQ", "BZN", "EUG", "BOI", "RNO", "GEG", "TUS", "LIH", "KOA", "ITO", "PSP", "DTW", "GUC", "OMA", "MTJ", "CLT", "CHS", "MCI", "BIL", "CVG", "CID", "MEM", "AVL", "BUF", "GUM", "JFK", "ADQ", "BET", "SCC", "BRW", "FAI", "JNU", "KTN", "SIT", "OME", "OTZ", "BUR", "OAK", "BLI", "SBA", "STL", "GFK", "SYR", "GSP", "FSD", "DSM", "ILM", "PWM", "BIS", "JAN", "GRB", "OAJ", "BTV", "TLH", "LAN", "MSN", "BMI", "BGR", "ABY", "MOT", "DHN", "MDT", "CMH", "FSM", "HSV", "FAR", "BQK", "GSO", "CHA", "AGS", "MOB", "MGM", "CWA", "TRI", "CSG", "TVC", "VLD", "PIA", "LFT", "GNV", "FAY", "AEX", "EWN", "CAE", "ELM", "GRR", "DAY", "ATW", "HRL", "LGB", "HOU", "STX", "HPN", "SWF", "PVD", "DAB", "BQN", "PSE", "STT", "ORH", "BTR", "ROW", "BHM", "MHK", "SAF", "MAF", "LBB", "COU", "GRK", "MLU", "AVP", "LIT", "LEX", "FNT", "SHV", "TUL", "EVV", "XNA", "VPS", "ICT", "SDF", "ELP", "ECP", "GPT", "CRW", "SBN", "CAK", "LCH", "AMA", "CRP", "PNS", "COS", "PHF", "RST", "HOB", "CLL", "SGF", "MFE", "MHT", "FWA", "LNK", "EYW", "ABE", "MLI", "ISP", "TTN", "SFB", "PIE", "IAG", "BLV", "TOL", "HGR", "ROA", "RFD", "PBG", "OWB", "HTS", "PGD", "USA", "LCK", "MFR", "AZA", "GRI", "FCA", "RAP", "OGD", "PVU", "YNG", "CPR", "MRY", "FAT", "SCK", "GJT", "SMX", "PPG", "SJT", "SUX", "JLN", "CMI", "ABI", "GCK", "TYR", "ALO", "SPI", "GGG", "SPS", "MQT", "DBQ", "SWO", "TXK", "AZO", "CHO", "ACT", "LAW", "BPT", "LBE", "ACY", "HVN", "MLB", "ASE", "LSE", "FLG", "RDM", "STS", "YUM", "DRO", "MEI", "PIB", "SGU", "IFP", "DAL", "PSC", "SBP", "SCE", "COD", "ERI", "ITH", "IMT", "DLH", "IDA", "SUN", "HLN", "TWF", "MBS", "GTF", "MDW", "ISN", "MSO", "GTR", "EAU", "CKB", "RKS", "PUB", "CGI", "UIN", "DVL", "JMS", "LAR", "OTH", "HYS", "GCC", "ACV", "RDD", "BFL", "MMH", "LWS", "PIH", "ABR", "APN", "ESC", "PLN", "BJI", "BTM", "CDC", "CIU", "EKO", "HIB", "BGM", "RHI", "BRD", "INL", "BRO", "LRD", "PSM", "CMX", "MKG", "PAH", "YAK", "CDV", "WRG", "PSG", "OGS", "STC", "ADK", "LYH", "BFF", "LBF", "LBL", "FLO", "PGV", "LWB", "SHD", "SLN", "CNY", "ACK", "MVY", "WYS", "HYA", "SPN", "GST", "AKN", "DLG", "BKG", "VEL", "HHH", "PRC", "EAR", "DRT", "CYS", "ART"),
                               labels = c(1:358)))


DEST.unique <- unique(flights.df$DEST, incomparables = FALSE)
flights.df$DEST_FAC = as.numeric(factor(flights.df$DEST, 
                               levels = c("DEN", "SFO", "ORD", "ALB", "OMA", "LAS", "CID", "EWR", "CLE", "PDX", "ATL", "BTV", "LAX", "SMF", "JAC", "TYS", "IAH", "HNL", "MIA", "IAD", "RIC", "RSW", "FLL", "SAN", "SLC", "ICT", "RNO", "EUG", "MCO", "MCI", "DTW", "CLT", "BOS", "PHX", "BZN", "MSY", "SJC", "MSP", "PBI", "DFW", "SAT", "AUS", "LGA", "IND", "TPA", "PHL", "BWI", "BNA", "DCA", "ONT", "SRQ", "SJU", "SNA", "RDU", "HDN", "SEA", "PSP", "BDL", "OKC", "ROC", "TUS", "ABQ", "PWM", "MYR", "KOA", "OGG", "JAX", "LIH", "BUF", "ORF", "EGE", "SAV", "CHS", "GUC", "AVL", "ANC", "PIT", "ITO", "GEG", "MTJ", "BIL", "CVG", "BOI", "MKE", "MFE", "GUM", "JFK", "BET", "ADQ", "SCC", "BRW", "FAI", "KTN", "JNU", "SIT", "PSG", "OME", "OTZ", "OAK", "BLI", "SBA", "BUR", "STL", "GFK", "MDT", "DSM", "JAN", "ABY", "LAN", "CWA", "GSO", "SYR", "AEX", "MOT", "BQK", "BMI", "CAE", "DHN", "EWN", "MSN", "ILM", "BGR", "FAR", "FSD", "GRB", "LFT", "TVC", "CMH", "LEX", "TLH", "GSP", "AGS", "MOB", "OAJ", "MGM", "TRI", "FSM", "PIA", "VLD", "BIS", "GNV", "FAY", "ELM", "CHA", "CSG", "HSV", "GRR", "ATW", "DAY", "HRL", "LGB", "STX", "DAB", "PVD", "SWF", "HPN", "HOU", "BQN", "PSE", "STT", "ORH", "BTR", "ROW", "BHM", "MHK", "SAF", "MAF", "LBB", "COU", "GRK", "MLU", "COS", "BRO", "VPS", "LCH", "AVP", "EVV", "SBN", "GPT", "SGF", "CLL", "SDF", "LRD", "CRW", "MEM", "TUL", "AMA", "LIT", "CAK", "CRP", "PNS", "MLI", "ECP", "RST", "XNA", "FWA", "ELP", "SHV", "FNT", "HOB", "EYW", "LNK", "PHF", "ABE", "TTN", "ISP", "SFB", "PIE", "IAG", "BLV", "TOL", "HGR", "ROA", "RFD", "PBG", "OWB", "HTS", "PGD", "USA", "LCK", "MFR", "GRI", "AZA", "FCA", "RAP", "OGD", "PVU", "YNG", "CPR", "MRY", "FAT", "SCK", "GJT", "SMX", "PPG", "MQT", "AZO", "TYR", "SJT", "SUX", "DBQ", "JLN", "ACT", "ABI", "GCK", "TXK", "ALO", "SPI", "CMI", "BPT", "GGG", "SWO", "LAW", "SPS", "CHO", "LBE", "ACY", "HVN", "MHT", "MLB", "ASE", "LSE", "DRO", "RDM", "FLG", "YUM", "STS", "MEI", "PIB", "SGU", "IFP", "DAL", "PSC", "SBP", "MBS", "GTR", "DLH", "SCE", "ERI", "ITH", "IMT", "HLN", "IDA", "GTF", "MDW", "SUN", "TWF", "ISN", "MSO", "CKB", "PUB", "CMX", "EAU", "PAH", "UIN", "RKS", "CGI", "JMS", "DVL", "LAR", "OTH", "GCC", "HYS", "RDD", "ACV", "MMH", "BFL", "LWS", "ABR", "COD", "APN", "ESC", "BJI", "BRD", "BTM", "CDC", "EKO", "HIB", "BGM", "CIU", "PLN", "RHI", "INL", "PIH", "PSM", "MKG", "YAK", "CDV", "WRG", "OGS", "STC", "ADK", "LYH", "BFF", "LBF", "LBL", "FLO", "PGV", "SHD", "LWB", "SLN", "CNY", "ACK", "MVY", "WYS", "HYA", "SPN", "GST", "DLG", "AKN", "BKG", "VEL", "HHH", "PRC", "EAR", "DRT", "CYS", "ART"),
                               labels = c(1:358)))

drop <- c("OP_CARRIER","ORIGIN","DEST")
flights.df <- flights.df[,!names(flights.df) %in% drop]

Transforming variable date to a date format

# Transform chr variable FL_DATE into a date variable, this will allow us to easily extract the months and days of each flight
flights.df$FL_DATE_dt <- as_date(flights.df$FL_DATE)
flights.df$month <- month(flights.df$FL_DATE_dt)
flights.df$mday <- mday(flights.df$FL_DATE_dt)
flights.df$wday <- wday(flights.df$FL_DATE_dt)

drop <- c("FL_DATE","FL_DATE_dt")
flights.df <- flights.df[,!names(flights.df) %in% drop]

Transforming the dependent variable into a binary variable (0 is on time, 1 is delayed)

flights.df$ARR_DELAY_15 <- ifelse(flights.df$ARR_DELAY <= 15,0,1)
flights.df <- subset(flights.df, select = names(flights.df) != "ARR_DELAY")

Deleting unused variables

drop <- c("OP_CARRIER_FL_NUM","CANCELLED","CANCELLATION_CODE","DIVERTED","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY","SECURITY_DELAY","LATE_AIRCRAFT_DELAY","Unnamed..27")
flights.df <- flights.df[,!names(flights.df) %in% drop]

Transforming NA values fro DEP_DELAY to 0

# DEP_DELAY should not be NA it should be 0, because it is a calculated variable (it is the difference in minutes between scheduled and actual departure time).
# Transforming records of DEP_DELAY with missing values to 0
flights.df[is.na(c(flights.df$DEP_DELAY)),3] <- 0

Dataset information

Dataset Size

Number of rows: nrow(flights.df) Number of columns: ncol(flights.df)

Dataset Missing Data

Function count the missing data (NA) values per variable

count_na_per_column <- function(df,column_name="na.count",result.df=NULL){
  if (is.null(result.df)){
    result.df <- data.frame(matrix(0, ncol = 1, nrow = ncol(df)),row.names = colnames(df))
    colnames(result.df) <- column_name
  }
  else {
    if(!column_name %in% colnames(result.df)){
      new_col = rep(0, nrow(result.df))
      result.df <- cbind(new_col , result.df)
      colnames(result.df)[1] <- column_name
    }
  }
  for (row_name in colnames(df)){
    result.df[row_name,column_name] <- sum(is.na(df[,row_name]))
  }
return(result.df)
}

Number of missing data per variable in na.count1

# Calling function count_na_per_column
na.col.df <- count_na_per_column(flights.df, "na.count1")
na.col.df

##                     na.count1
## CRS_DEP_TIME                0
## DEP_TIME                    0
## DEP_DELAY                   0
## TAXI_OUT                    0
## WHEELS_OFF                  0
## WHEELS_ON                   0
## TAXI_IN                     0
## CRS_ARR_TIME                0
## ARR_TIME                    0
## CRS_ELAPSED_TIME            0
## ACTUAL_ELAPSED_TIME         0
## AIR_TIME                    0
## DISTANCE                    0
## OP_CARRIER_FAC              0
## ORIGIN_FAC                  0
## DEST_FAC                    0
## month                       0
## mday                        0
## wday                        0
## ARR_DELAY_15                0

Dataset Outliers

Summary

summary(flights.df)

##   CRS_DEP_TIME     DEP_TIME      DEP_DELAY          TAXI_OUT       WHEELS_OFF  
##  Min.   :   1   Min.   :   1   Min.   :-122.00   Min.   :  1.0   Min.   :   1  
##  1st Qu.: 911   1st Qu.: 914   1st Qu.:  -5.00   1st Qu.: 11.0   1st Qu.: 930  
##  Median :1320   Median :1326   Median :  -2.00   Median : 15.0   Median :1340  
##  Mean   :1328   Mean   :1333   Mean   :   9.88   Mean   : 17.4   Mean   :1357  
##  3rd Qu.:1735   3rd Qu.:1744   3rd Qu.:   6.00   3rd Qu.: 20.0   3rd Qu.:1759  
##  Max.   :2359   Max.   :2400   Max.   :1861.00   Max.   :196.0   Max.   :2400  
##    WHEELS_ON       TAXI_IN       CRS_ARR_TIME     ARR_TIME    CRS_ELAPSED_TIME
##  Min.   :   1   Min.   :  1.0   Min.   :   1   Min.   :   1   Min.   :  1.0   
##  1st Qu.:1043   1st Qu.:  4.0   1st Qu.:1100   1st Qu.:1048   1st Qu.: 89.0   
##  Median :1501   Median :  6.0   Median :1515   Median :1505   Median :122.0   
##  Mean   :1462   Mean   :  7.6   Mean   :1485   Mean   :1466   Mean   :141.5   
##  3rd Qu.:1911   3rd Qu.:  9.0   3rd Qu.:1918   3rd Qu.:1916   3rd Qu.:171.0   
##  Max.   :2400   Max.   :242.0   Max.   :2400   Max.   :2400   Max.   :703.0   
##  ACTUAL_ELAPSED_TIME    AIR_TIME        DISTANCE      OP_CARRIER_FAC 
##  Min.   : 16.0       Min.   :  8.0   Min.   :  31.0   Min.   : 1.00  
##  1st Qu.: 83.0       1st Qu.: 60.0   1st Qu.: 364.0   1st Qu.: 7.00  
##  Median :118.0       Median : 93.0   Median : 636.0   Median :14.00  
##  Mean   :136.7       Mean   :111.7   Mean   : 803.6   Mean   :11.53  
##  3rd Qu.:167.0       3rd Qu.:141.0   3rd Qu.:1037.0   3rd Qu.:17.00  
##  Max.   :757.0       Max.   :675.0   Max.   :4983.0   Max.   :18.00  
##    ORIGIN_FAC       DEST_FAC          month             mday      
##  Min.   :  1.0   Min.   :  1.00   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 17.0   1st Qu.: 13.00   1st Qu.: 4.000   1st Qu.: 8.00  
##  Median : 36.0   Median : 36.00   Median : 7.000   Median :16.00  
##  Mean   : 58.4   Mean   : 57.14   Mean   : 6.586   Mean   :15.79  
##  3rd Qu.: 74.0   3rd Qu.: 63.00   3rd Qu.:10.000   3rd Qu.:23.00  
##  Max.   :358.0   Max.   :358.00   Max.   :12.000   Max.   :31.00  
##       wday        ARR_DELAY_15   
##  Min.   :1.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :4.000   Median :0.0000  
##  Mean   :3.945   Mean   :0.1842  
##  3rd Qu.:6.000   3rd Qu.:0.0000  
##  Max.   :7.000   Max.   :1.0000

Boxplot of numeric variables

boxplot(flights.df[,-c(17:20)])

Most of the numeric variables (DEP_TIME, DISTANCE, etc.) present a significant number of outliers. These variables probably don’t have a linear pattern, so they are good candidates for transformations (exp, log, etc.). But, because we are using ANN, that is design to capture complex relations, we decided to not apply any kind of transformation.

Variables information

flights.df$DEP_TIME <- as.integer(flights.df$DEP_TIME)
flights.df$DEP_DELAY <- as.integer(flights.df$DEP_DELAY)
flights.df$TAXI_OUT <- as.integer(flights.df$TAXI_OUT)
flights.df$WHEELS_OFF <- as.integer(flights.df$WHEELS_OFF)
flights.df$WHEELS_ON <- as.integer(flights.df$WHEELS_ON)
flights.df$TAXI_IN <- as.integer(flights.df$TAXI_IN)
flights.df$ARR_TIME <- as.integer(flights.df$ARR_TIME)
flights.df$CRS_ELAPSED_TIME <- as.integer(flights.df$CRS_ELAPSED_TIME)
flights.df$ACTUAL_ELAPSED_TIME <- as.integer(flights.df$ACTUAL_ELAPSED_TIME)
flights.df$AIR_TIME <- as.integer(flights.df$AIR_TIME)
flights.df$DISTANCE <- as.integer(flights.df$DISTANCE)
flights.df$month <- as.integer(flights.df$month)
flights.df$wday <- as.integer(flights.df$wday)


as.integer

## function (x, ...)  .Primitive("as.integer")

str(flights.df)

## 'data.frame':    707715 obs. of  20 variables:
##  $ CRS_DEP_TIME       : int  1136 1149 1417 1033 1109 1320 1306 1304 600 600 ...
##  $ DEP_TIME           : int  1143 1209 1412 1025 1058 1320 1306 1332 554 554 ...
##  $ DEP_DELAY          : int  7 20 -5 -8 -11 0 0 28 -6 -6 ...
##  $ TAXI_OUT           : int  14 40 13 13 11 16 11 11 12 15 ...
##  $ WHEELS_OFF         : int  1157 1249 1425 1038 1109 1336 1317 1343 606 609 ...
##  $ WHEELS_ON          : int  1442 1318 1527 1030 1139 1524 1421 1740 758 738 ...
##  $ TAXI_IN            : int  14 3 12 8 6 6 15 5 9 11 ...
##  $ CRS_ARR_TIME       : int  1446 1245 1601 1104 1225 1533 1501 1745 815 806 ...
##  $ ARR_TIME           : int  1456 1321 1539 1038 1145 1530 1436 1745 807 749 ...
##  $ CRS_ELAPSED_TIME   : int  190 56 104 91 136 133 235 221 75 126 ...
##  $ ACTUAL_ELAPSED_TIME: int  193 72 87 73 107 130 210 193 73 115 ...
##  $ AIR_TIME           : int  165 29 62 52 90 108 184 177 52 89 ...
##  $ DISTANCE           : int  1216 160 431 316 622 840 1485 1521 352 577 ...
##  $ OP_CARRIER_FAC     : num  2 12 17 12 9 7 6 18 18 18 ...
##  $ ORIGIN_FAC         : num  65 5 80 45 42 29 47 36 182 42 ...
##  $ DEST_FAC           : num  56 148 40 3 3 208 1 87 11 11 ...
##  $ month              : int  4 6 8 6 5 10 7 11 8 1 ...
##  $ mday               : int  16 21 7 18 25 24 9 3 25 22 ...
##  $ wday               : int  2 5 3 2 6 4 2 7 7 2 ...
##  $ ARR_DELAY_15       : num  0 1 0 0 0 0 0 0 0 0 ...

Description of the chosen variables

TAXI_OUT : Taxi Out Time, in Minutes
WHEELS_OFF : Wheels Off Time (local time: hhmm)
WHEELS_ON : Wheels On Time (local time: hhmm)
TAXI_IN : Taxi In Time, in Minutes
CRS_ARR_TIME : CRS Arrival Time (local time: hhmm)
ARR_TIME : Actual Arrival Time (local time: hhmm)
CRS_ELAPSED_TIME : CRS Elapsed Time of Flight, in Minutes
ACTUAL_ELAPSED_TIME: Elapsed Time of Flight, in Minutes
AIR_TIME : Flight Time, in Minutes
DISTANCE : Distance between airports (miles)
ARR_DELAY_15 : Delayed / On Time flight
OP_CARRIER_FAC : Carrier Identification Code
ORIGIN_FAC : Origin Airport
DEST_FAC : Destination Airport
month : Month of the flight
mday : Day of the month of the flight
wday : Day of the week of the flight

Organizing the Dataset

Correlation Matrix before Feature scaling

library(corrplot)

## corrplot 0.91 loaded

corrplot(cor(flights.df[,-c(20)]), type="upper", order="hclust", 
         tl.col="black", tl.srt=45, sig.level = 0.01)

Splitting the data frame into the training and test datasets

set.seed(123)
split = sample.split(flights.df$ARR_DELAY_15, SplitRatio = 0.75)
trainig.df = subset(flights.df, split == TRUE)
test.df = subset(flights.df, split == FALSE)

Feature scaling (normalization) is necessary for ANN

trainig.df[-20] <- scale(trainig.df[-20])
test.df[-20] <- scale(test.df[-20])
head(trainig.df)

##    CRS_DEP_TIME     DEP_TIME  DEP_DELAY   TAXI_OUT  WHEELS_OFF   WHEELS_ON
## 1   -0.38994768 -0.375914688 -0.0647989 -0.3440704 -0.39408430 -0.03594699
## 2   -0.36347820 -0.245168518  0.2237387  2.2949927 -0.21238690 -0.26843983
## 4   -0.59966737 -0.609672991 -0.3977269 -0.4455728 -0.62910593 -0.80842321
## 7   -0.04380838 -0.053011269 -0.2201653 -0.6485776 -0.07808882 -0.07532078
## 8   -0.04788061 -0.001505202  0.4013003 -0.6485776 -0.02673956  0.52278581
## 10  -1.48130452 -1.542725201 -0.3533365 -0.2425679 -1.47636880 -1.35590636
##        TAXI_IN CRS_ARR_TIME    ARR_TIME CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME
## 1   1.05341039  -0.07404063 -0.01853090        0.6585049           0.7688279
## 2  -0.75910456  -0.46181178 -0.26968077       -1.1636429          -0.8840199
## 4   0.06476587  -0.73383034 -0.79616530       -0.6877088          -0.8703600
## 7   1.21818448   0.03206590 -0.05573829        1.2704202           1.0010462
## 8  -0.42955639   0.50279306  0.51911585        1.0800466           0.7688279
## 10  0.55908813  -1.30873482 -1.33381205       -0.2117746          -0.2966443
##      AIR_TIME   DISTANCE OP_CARRIER_FAC ORIGIN_FAC    DEST_FAC      month
## 1   0.7486128  0.6868359    -1.68959434  0.1005189 -0.01696844 -0.7613265
## 2  -1.1622506 -1.0739797     0.08254487 -0.8169004  1.37179180 -0.1721135
## 4  -0.8390898 -0.8138592     0.08254487 -0.2052876 -0.81701509 -0.1721135
## 7   1.0155717  1.1353769    -0.98073866 -0.1747069 -0.84720553  0.1224930
## 8   0.9172184  1.1954047     1.14582839 -0.3429005  0.45098338  1.3009191
## 10 -0.3192226 -0.3786576     1.14582839 -0.2511585 -0.69625333 -1.6451460
##           mday       wday ARR_DELAY_15
## 1   0.02450626 -0.9899159            0
## 2   0.59387529  0.5359974            1
## 4   0.25225387 -0.9899159            0
## 7  -0.77261039 -0.9899159            0
## 8  -1.45585324  1.5532730            0
## 10  0.70774910 -0.9899159            0

Correlation Matrix after Feature scaling

corrplot(cor(trainig.df[,-c(20)]), type="upper", order="hclust", 
         tl.col="black", tl.srt=45, sig.level = 0.01)

The correlation of the variables was kept after the Feature scaling (normalization)

Data Visualization

Data transformation for the graphs

flights.df.fac <- flights.df
flights.df.fac$ARR_DELAY_15 <- factor(flights.df.fac$ARR_DELAY_15, levels = c(0,1), labels = c("On Time","Delayed"))
flights.df.fac$wday <- factor(flights.df.fac$wday, levels = c(1,2,3,4,5,6,7), labels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))
flights.df.fac$month <- factor(flights.df.fac$month, levels = c(1,2,3,4,5,6,7,8,9,10,11,12), labels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

Barplot: Days of the Week vs flight delays (with absolute values)

bar <- ggplot(data = flights.df.fac, aes(x=wday, fill=ARR_DELAY_15)) +
  geom_bar() +
  labs(title="Days of the Week vs flight delays", y="Flyghts Total",x="Days of the week") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight()
bar

This graph shows flights delayed versus on time on a weekly basis. Tuesday, Friday, and Saturday result in the most amount of delays. Sunday has the lowest therefore Sunday might be the ideal day f0r those to fly without worry of a delay.

Barplot: Days of the Week vs flight delays (with relative values)

bar <- ggplot(data = flights.df.fac, aes(x=wday, fill=ARR_DELAY_15)) +
  geom_bar(position = "fill") +
  labs(title="Days of the Week vs flight delays", y="Flyghts Total",x="Days of the week") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight()
bar

Barplot: Months vs flight delays (with absolute values)

bar <- ggplot(data = flights.df.fac, aes(x=month, fill=ARR_DELAY_15)) +
  geom_bar() +
  labs(title="Months vs flight delays", y="Flyghts Total",x="Months") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar

This graph shows months vs flights delays and flights on time on a monthly basis. The busiest time in airports and more likely to encounter flight delays are months June-August while September and October have the lowest number of delays. However, it must be taken into consideration that months like February only have 28-29 days. The year of 2018, as used in graph, February had only 28 days.

Barplot: Months vs flight delays (with relative values)

bar <- ggplot(data = flights.df.fac, aes(x=month, fill=ARR_DELAY_15)) +
  geom_bar(position = "fill") +
  labs(title="Months vs flight delays", y="Flyghts Total",x="Months") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar

Barplot: Day of the month vs flight delays (with absolute values)

bar <- ggplot(data = flights.df.fac, aes(x=mday, fill=ARR_DELAY_15)) +
  geom_bar() +
  labs(title="Day of the month vs flight delays", y="Flyghts Total",x="Days of the month") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar

This graph shows the amount of flight delays vs flights on time each day of each month. It can be shown that day 31 has less data than every other month because not every month has 31 days. It appears that days 15 and 20 of each month have 50,000 or more delays compared to other days.

Barplot: Day of the month vs flight delays (with relative values)

bar <- ggplot(data = flights.df.fac, aes(x=mday, fill=ARR_DELAY_15)) +
  geom_bar(position = "fill") +
  labs(title="Day of the month vs flight delays", y="Flyghts Total",x="Days of the month") +
  scale_fill_discrete(name = "Arrived on time or delayed: ") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar

Research Question

Question

Can I predict/classify if a flight is delayed or on time using the variables below?

TAXI_OUT : Taxi Out Time, in Minutes
WHEELS_OFF : Wheels Off Time (local time: hhmm)
WHEELS_ON : Wheels On Time (local time: hhmm)
TAXI_IN : Taxi In Time, in Minutes
CRS_ARR_TIME : CRS Arrival Time (local time: hhmm)
ARR_TIME : Actual Arrival Time (local time: hhmm)
CRS_ELAPSED_TIME : CRS Elapsed Time of Flight, in Minutes
ACTUAL_ELAPSED_TIME: Elapsed Time of Flight, in Minutes
AIR_TIME : Flight Time, in Minutes
DISTANCE : Distance between airports (miles)
ARR_DELAY_15 : Delayed / On Time flight
OP_CARRIER_FAC : Carrier Identification Code
ORIGIN_FAC : Origin Airport
DEST_FAC : Destination Airport
month : Month of the flight
mday : Day of the month of the flight
wday : Day of the week of the flight

Method

Advantages:

ANN is an appropriate method for classification within large datasets
ANN is famous by its advantages in classification/prediction in relation to other AI methods
Can capture complex relationships between inputs and outputs

Disadvantages:

Although ANN has amazing predictive capabilities, it has no explanatory value (Blackbox), but h2o package can help calculating the variable importance table
There is no variable-selection mechanism therefore you must exercise care in selecting variables
Heavy computational requirements if there are many variables

Precautions

Precautions that must be taken when using NN:

Overfitting
Predictors must be carefully chosen
The dataset must be larger for training purposes. Being extra careful is required when the class of interest (ex. Delayed flights) is considered rare compared to other classes (oversampling may be necessary)
Obtaining local optimum instead of global optimum

Training the ANN with the training dataset

h2o.init(nthreads = -1) # Uses all threads of all cores available in the CPU

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\ADMINI~1\AppData\Local\Temp\RtmpmGSEO7\file532c5d952af5/h2o_Administrador_started_from_r.out
##     C:\Users\ADMINI~1\AppData\Local\Temp\RtmpmGSEO7\file532c36ceff/h2o_Administrador_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 798 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.34.0.3 
##     H2O cluster version age:    1 year, 1 month and 15 days !!! 
##     H2O cluster name:           H2O_started_from_R_Administrador_rej699 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   15.98 GB 
##     H2O cluster total cores:    12 
##     H2O cluster allowed cores:  12 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.3 (2020-10-10)

## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (1 year, 1 month and 15 days)!
## Please download and install the latest version from http://h2o.ai/download/

classifier = h2o.deeplearning(y="ARR_DELAY_15", 
                              standardize = T, # This package automatically standardize the data with mean 0 and variance 1
                              categorical_encoding = "OneHotInternal", # This option create a dummy variable for each category in categorical variables 
                              training_frame = as.h2o(trainig.df), 
                              activation = "Rectifier", # Activation function function 
                              hidden = c(10,10,10,10), # Two hidden layers with 11 neurons each.
                              epochs = 50,
                              nfolds = 5, # Cross validation with 5 folds
                              train_samples_per_iteration = -2)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

Summary of the training

summary(classifier)

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  DeepLearning_model_R_1669150598387_1 
## Status of Neuron Layers: predicting ARR_DELAY_15, regression, gaussian distribution, Quadratic loss, 541 weights/biases, 13.6 KB, 26,690,166 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms momentum
## 1     1    19     Input  0.00 %       NA       NA        NA       NA       NA
## 2     2    10 Rectifier  0.00 % 0.000000 0.000000  0.000666 0.000484 0.000000
## 3     3    10 Rectifier  0.00 % 0.000000 0.000000  0.006861 0.015110 0.000000
## 4     4    10 Rectifier  0.00 % 0.000000 0.000000  0.013645 0.068765 0.000000
## 5     5    10 Rectifier  0.00 % 0.000000 0.000000  0.027511 0.107091 0.000000
## 6     6     1    Linear      NA 0.000000 0.000000  0.037209 0.038230 0.000000
##   mean_weight weight_rms mean_bias bias_rms
## 1          NA         NA        NA       NA
## 2    0.014742   0.460029  0.885805 0.696050
## 3   -0.131421   0.697873  1.245598 0.752515
## 4   -0.319426   1.921777  0.951062 0.214939
## 5   -0.194372   0.698295  1.081703 0.439389
## 6   -0.099810   0.831445  0.378101 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10109 samples **
## 
## MSE:  0.0002928048
## RMSE:  0.01711154
## MAE:  0.005414779
## RMSLE:  0.01093758
## Mean Residual Deviance :  0.0002928048
## 
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.0007267423
## RMSE:  0.02695816
## MAE:  0.003436231
## RMSLE:  0.01973107
## Mean Residual Deviance :  0.0007267423
## 
## 
## Cross-Validation Metrics Summary: 
##                            mean       sd cv_1_valid cv_2_valid cv_3_valid
## mae                    0.003435 0.001859   0.002444   0.001783   0.006350
## mean_residual_deviance 0.000726 0.000343   0.000954   0.000285   0.001038
## mse                    0.000726 0.000343   0.000954   0.000285   0.001038
## r2                     0.995168 0.002278   0.993660   0.998096   0.993101
## residual_deviance      0.000726 0.000343   0.000954   0.000285   0.001038
## rmse                   0.026227 0.006931   0.030884   0.016867   0.032218
## rmsle                  0.019072 0.005629   0.022459   0.010802   0.023811
##                        cv_4_valid cv_5_valid
## mae                      0.004187   0.002410
## mean_residual_deviance   0.000923   0.000432
## mse                      0.000923   0.000432
## r2                       0.993844   0.997141
## residual_deviance        0.000923   0.000432
## rmse                     0.030386   0.020778
## rmsle                    0.022638   0.015649
## 
## Scoring History: 
##              timestamp          duration training_speed   epochs iterations
## 1  2022-11-22 16:00:20         0.000 sec             NA  0.00000          0
## 2  2022-11-22 16:00:20  3 min 35.471 sec 378276 obs/sec  0.18815          1
## 3  2022-11-22 16:00:25  3 min 40.536 sec 375801 obs/sec  3.76873         20
## 4  2022-11-22 16:00:30  3 min 45.616 sec 413554 obs/sec  8.10067         43
## 5  2022-11-22 16:00:35  3 min 50.777 sec 437067 obs/sec 12.80686         68
## 6  2022-11-22 16:00:40  3 min 55.915 sec 454209 obs/sec 17.70161         94
## 7  2022-11-22 16:00:46  4 min  1.052 sec 468365 obs/sec 22.78092        121
## 8  2022-11-22 16:00:51  4 min  6.198 sec 481065 obs/sec 28.05804        149
## 9  2022-11-22 16:00:56  4 min 11.335 sec 495695 obs/sec 33.70313        179
## 10 2022-11-22 16:01:01  4 min 16.388 sec 507819 obs/sec 39.35607        209
## 11 2022-11-22 16:01:06  4 min 21.465 sec 517029 obs/sec 45.00940        239
## 12 2022-11-22 16:01:10  4 min 25.287 sec 533589 obs/sec 50.28423        267
##            samples training_rmse training_deviance training_mae training_r2
## 1         0.000000            NA                NA           NA          NA
## 2     99865.000000       0.12611           0.01590      0.04983     0.89416
## 3   2000391.000000       0.07894           0.00623      0.01193     0.95853
## 4   4299723.000000       0.04875           0.00238      0.00676     0.98419
## 5   6797704.000000       0.07165           0.00513      0.01532     0.96584
## 6   9395769.000000       0.04653           0.00216      0.00991     0.98559
## 7  12091795.000000       0.04472           0.00200      0.00454     0.98669
## 8  14892817.000000       0.04430           0.00196      0.00753     0.98694
## 9  17889152.000000       0.04822           0.00232      0.00530     0.98453
## 10 20889653.000000       0.03462           0.00120      0.00553     0.99202
## 11 23890359.000000       0.02960           0.00088      0.00616     0.99417
## 12 26690166.000000       0.01711           0.00029      0.00541     0.99805
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##               variable relative_importance scaled_importance percentage
## 1     CRS_ELAPSED_TIME            1.000000          1.000000   0.286092
## 2  ACTUAL_ELAPSED_TIME            0.616536          0.616536   0.176386
## 3            DEP_DELAY            0.610545          0.610545   0.174672
## 4             AIR_TIME            0.372377          0.372377   0.106534
## 5             DISTANCE            0.138993          0.138993   0.039765
## 6           WHEELS_OFF            0.100363          0.100363   0.028713
## 7         CRS_DEP_TIME            0.088343          0.088343   0.025274
## 8             TAXI_OUT            0.079569          0.079569   0.022764
## 9            WHEELS_ON            0.074556          0.074556   0.021330
## 10            DEP_TIME            0.072232          0.072232   0.020665
## 11        CRS_ARR_TIME            0.071312          0.071312   0.020402
## 12            ARR_TIME            0.057389          0.057389   0.016418
## 13             TAXI_IN            0.051079          0.051079   0.014613
## 14                wday            0.042227          0.042227   0.012081
## 15               month            0.033502          0.033502   0.009585
## 16          ORIGIN_FAC            0.030228          0.030228   0.008648
## 17            DEST_FAC            0.022318          0.022318   0.006385
## 18                mday            0.016937          0.016937   0.004845
## 19      OP_CARRIER_FAC            0.016874          0.016874   0.004828

plot(classifier,
     timestep = "epochs",
     metric = "rmse")

In the summary we could verify that the model got good model fit (e.g. RMSE).
The Training Scoring History shows better model fit when the during the computation of epochs.
The model fit for each fold of the cross-validation process, indicating no overfit of the ANN.

Predicting the training and testing datasets

prob_pred.training = h2o.predict(classifier, newdata = as.h2o(trainig.df[-20]))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

y_pred.training = (prob_pred.training > 0.5)
y_pred.training = as.vector(y_pred.training)

prob_pred = h2o.predict(classifier, newdata = as.h2o(test.df[-20]))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

y_pred = (prob_pred > 0.5)
y_pred = as.vector(y_pred)

Model Accuracy

Confusion Matrix, accuracy and significance of the training dataset

library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)

## Warning: package 'e1071' was built under R version 4.0.5

cm.training <- confusionMatrix(as.factor(y_pred.training), as.factor(trainig.df$ARR_DELAY_15))
cm.training

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 433013    128
##          1     11  97634
##                                           
##                Accuracy : 0.9997          
##                  95% CI : (0.9997, 0.9998)
##     No Information Rate : 0.8158          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9991          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9987          
##          Pos Pred Value : 0.9997          
##          Neg Pred Value : 0.9999          
##              Prevalence : 0.8158          
##          Detection Rate : 0.8158          
##    Detection Prevalence : 0.8160          
##       Balanced Accuracy : 0.9993          
##                                           
##        'Positive' Class : 0               
##

Confusion Matrix, accuracy and significance of the testing dataset

cm.test <- confusionMatrix(as.factor(y_pred), as.factor(test.df$ARR_DELAY_15))
cm.test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 144091    134
##          1    251  32453
##                                          
##                Accuracy : 0.9978         
##                  95% CI : (0.9976, 0.998)
##     No Information Rate : 0.8158         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9928         
##                                          
##  Mcnemar's Test P-Value : 3.382e-09      
##                                          
##             Sensitivity : 0.9983         
##             Specificity : 0.9959         
##          Pos Pred Value : 0.9991         
##          Neg Pred Value : 0.9923         
##              Prevalence : 0.8158         
##          Detection Rate : 0.8144         
##    Detection Prevalence : 0.8152         
##       Balanced Accuracy : 0.9971         
##                                          
##        'Positive' Class : 0              
##

Results

The model presented p-values lower than 0.01% (Mcnemar’s Test P-Value : 2.2e-16) and accuracy over 99% for the training and testing datasets.
The sensitivity and specificity of the model presented values over 99% for the training and testing datasets

Conclusion

The results showed similar results for the training and testing datasets, which indicates no overfitting
The accuracy, Sensitivity and Specificity of the model for the training and testing datasets were above 99%, which indicate excellent classification accuracy.
Based on the importance of the variables for training the ANN, Airlines and Airports can get insights into the causes of flight delays. For example:
DEP_DELAY (Difference in minutes between scheduled and actual departure time) shows that a delay on the departure of the plane can contribute to a flight delay.
The day and month of the flight apparently have a small contribution to predicting if the flight will be on time or delayed.
Unfortunately, for the passengers, the ANN cannot contribute much. The variables below are the ones that are available for passenger to try to classify/predict if his/her flight would be on time or delayed, but most of these variables scored very low in the variable importance table.
- DISTANCE : Distance between airports (miles)
- DEST_FAC : Destination Airport
- CRS_ARR_TIME : CRS Arrival Time (local time: hhmm)
- ORIGIN_FAC : Origin Airport
- wday : Day of the week of the flight
- mday : Day of the month of the flight
- month : Month of the flight
- OP_CARRIER_FAC : Carrier Identification Code
- AIR_TIME : Flight Time, in Minutes
On the appendix, an ANN was created using only the variables above. The results show acceptable accuracy (over 80%) and high sensitivity (over 90%). However, the specificity was very low (below 10%).

Appendix