Comparing Flight Data with Predictive Analysis

This report attempted to determine a predictor metric for departure delays.
None of the variables tested were able to predict any sort of linear relationship.


Data Dictionary:

FlightDate - Flight Date (yyyymmdd)
Carrier - Carrier code for the airline
TailNum - Tail Number
FlightNum - Flight Number
Origin - The airport code, PWM = Portland, BGR = Bangor
Dest - Destination airport code
DestCityName - Destination city name
DestState - Destination state
CRSDepTime - Scheduled departure time (local time: hhmm)
DepTime - Actual departure time (local time: hhmm)
WheelsOff - Wheels off time (local time: hhmm)
WheelsOn - Wheels on time (local time: hhmm)
CRSArrTime - Scheduled arrival time (local time: hhmm)
ArrTime - Actual arrival time (local time: hhmm)
Cancelled - Cancelled flight indicator (1 = yes)
Diverted - Diverted flight indicator (1 = yes)
CRSElapsedTime - Scheduled elapsed time of flight, in minutes
ActualElapsedTime - Actual elapsed time of flight, in minutes
Distance - Distance between airports, in miles
DepDelay - Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
DepDelayMinutes - Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DepDel30 - Departure Delay Indicator, 30 Minutes or More (1=Yes)
TaxiOut - Taxi out time, in minutes
TaxiIn - Taxi in time, in minutes
ArrDelay - Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
ArrDelayMinutes - Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
ArrDel30 - Arrival Delay Indicator, 30 Minutes or More (1=Yes)
FlightTimeBuffer - The difference between scheduled elapsed time and actual elapsed time.
AirTime - The time spent in the air.
AirSpeed - Average speed of the plane in flight (mph)


Load The Data File

flights <- read.csv("domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)

Load Libraries

library(tidyr)
library(dplyr)
library(knitr)
library(ggvis)
library(xtable)
library(utils)
library(pander)
library(chron)
library(broom)

Check the Data Structures

Field Headings

names(flights)
##  [1] "FlightDate"        "Carrier"           "TailNum"          
##  [4] "FlightNum"         "Origin"            "OriginCityName"   
##  [7] "OriginState"       "Dest"              "DestCityName"     
## [10] "DestState"         "CRSDepTime"        "DepTime"          
## [13] "WheelsOff"         "WheelsOn"          "CRSArrTime"       
## [16] "ArrTime"           "Cancelled"         "Diverted"         
## [19] "CRSElapsedTime"    "ActualElapsedTime" "Distance"

Structure

str(flights)
## 'data.frame':    445827 obs. of  21 variables:
##  $ FlightDate       : chr  "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N4YBAA" "N434AA" "N541AA" "N489AA" ...
##  $ FlightNum        : int  43 43 43 43 43 43 43 43 43 43 ...
##  $ Origin           : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ OriginCityName   : chr  "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
##  $ OriginState      : chr  "TX" "TX" "TX" "TX" ...
##  $ Dest             : chr  "DTW" "DTW" "DTW" "DTW" ...
##  $ DestCityName     : chr  "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
##  $ DestState        : chr  "MI" "MI" "MI" "MI" ...
##  $ CRSDepTime       : int  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ DepTime          : int  1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
##  $ WheelsOff        : int  1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
##  $ WheelsOn         : int  1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
##  $ CRSArrTime       : int  1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
##  $ ArrTime          : int  1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  158 158 158 158 158 158 158 158 158 158 ...
##  $ ActualElapsedTime: int  155 150 170 151 171 148 159 156 150 158 ...
##  $ Distance         : int  986 986 986 986 986 986 986 986 986 986 ...

Quick Check for records with Duplicate Tail Numbers.

pandoc.table(head(flights  %>% count(length(unique(TailNum))), style = "grid"))
## 
## --------------------------------
##  length(unique(TailNum))    n   
## ------------------------- ------
##           4239            445827
## --------------------------------

Confirmed records with duplicate tail numbers exist.

There are 445827 Records and 4239 Unique Tail Numbers

Unfortunately, this means I cannot join previously munged data with the new destination city and state fields.


Check for Missing Data

Check the Number of Cancelled Flights

pandoc.table(flights %>% count(Cancelled == 1), style = 'grid')
## 
## 
## +------------------+--------+
## |  Cancelled == 1  |   n    |
## +==================+========+
## |      FALSE       | 434162 |
## +------------------+--------+
## |       TRUE       | 11665  |
## +------------------+--------+

There are 11665 cancelled flights

Because this study is only interested in completed flights, the records for cancelled flights will be removed.

flights <- flights %>% filter(Cancelled == 0)

Re-check the Number of Cancelled Flights

pandoc.table(flights %>% count(Cancelled == 1), style = 'grid')
## 
## 
## +------------------+--------+
## |  Cancelled == 1  |   n    |
## +==================+========+
## |      FALSE       | 434162 |
## +------------------+--------+

All cancelled flights have successfully been removed

Check the Number of Incomplete Cases

pandoc.table(flights %>% count(!complete.cases(.)), style = "grid")
## 
## 
## +----------------------+--------+
## |  !complete.cases(.)  |   n    |
## +======================+========+
## |        FALSE         | 433298 |
## +----------------------+--------+
## |         TRUE         |  864   |
## +----------------------+--------+

There are 864 incomplete cases

Sample the incomplete cases

head(flights %>% filter(!complete.cases(.)))
##   FlightDate Carrier TailNum FlightNum Origin        OriginCityName
## 1  1/15/2016      AA  N3ALAA        56    DEN            Denver, CO
## 2  1/15/2016      AA  N3GUAA       208    SFO     San Francisco, CA
## 3  1/10/2016      AA  N3BVAA       210    LAS         Las Vegas, NV
## 4  1/15/2016      AA  N3HFAA       217    LAS         Las Vegas, NV
## 5  1/10/2016      AA  N796AA        34    LAX       Los Angeles, CA
## 6  1/22/2016      AA  N480AA       248    DFW Dallas/Fort Worth, TX
##   OriginState Dest  DestCityName DestState CRSDepTime DepTime WheelsOff
## 1          CO  MIA     Miami, FL        FL       1045    1042      1059
## 2          CA  MIA     Miami, FL        FL        640     638       656
## 3          NV  JFK  New York, NY        NY        820     818       835
## 4          NV  MIA     Miami, FL        FL       2359    2356        15
## 5          CA  JFK  New York, NY        NY        800     759       813
## 6          TX  BNA Nashville, TN        TN        700     655       704
##   WheelsOn CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime
## 1     1852       1639    1902         0        1            234
## 2     1719       1458    1728         0        1            318
## 3     1852       1615    1923         0        1            295
## 4     1630        725    1642         0        1            266
## 5     1906       1629    1923         0        1            329
## 6      730        847     736         0        1            107
##   ActualElapsedTime Distance
## 1                NA     1709
## 2                NA     2585
## 3                NA     2248
## 4                NA     2174
## 5                NA     2475
## 6                NA      631

It appears the incomplete cases are records with “NA” valuse for ActualElapsedTime

Check the Number of Records with ActualElapsedTime == “NA”

pandoc.table(flights %>% count(ActualElapsedTime == "NA"), style = "grid")
## 
## 
## +-----------------------------+--------+
## |  ActualElapsedTime == "NA"  |   n    |
## +=============================+========+
## |            FALSE            | 433298 |
## +-----------------------------+--------+
## |             NA              |  864   |
## +-----------------------------+--------+

There are 864 Records with ActualElapsedTime entered as “NA”

Because this study is only interested in flights with ActualElapsedTime data, the records records with ‘NA’ valuse for ActualElapsedTime will be removed.

flights <- flights %>% filter(ActualElapsedTime != "NA")

Check the Number of Incomplete Cases

pandoc.table(flights %>% count(!complete.cases(.)), style = "grid")
## 
## 
## +----------------------+--------+
## |  !complete.cases(.)  |   n    |
## +======================+========+
## |        FALSE         | 433298 |
## +----------------------+--------+

All records now have complete data

Data Summary

summary(flights, 6)
##   FlightDate          Carrier            TailNum            FlightNum   
##  Length:433298      Length:433298      Length:433298      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 701  
##  Mode  :character   Mode  :character   Mode  :character   Median :1591  
##                                                           Mean   :2077  
##                                                           3rd Qu.:2762  
##                                                           Max.   :7438  
##     Origin          OriginCityName     OriginState       
##  Length:433298      Length:433298      Length:433298     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      Dest           DestCityName        DestState           CRSDepTime  
##  Length:433298      Length:433298      Length:433298      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 920  
##  Mode  :character   Mode  :character   Mode  :character   Median :1325  
##                                                           Mean   :1330  
##                                                           3rd Qu.:1730  
##                                                           Max.   :2359  
##     DepTime       WheelsOff       WheelsOn      CRSArrTime  
##  Min.   :   1   Min.   :   1   Min.   :   1   Min.   :   1  
##  1st Qu.: 924   1st Qu.: 939   1st Qu.:1104   1st Qu.:1118  
##  Median :1331   Median :1344   Median :1519   Median :1527  
##  Mean   :1334   Mean   :1357   Mean   :1483   Mean   :1503  
##  3rd Qu.:1737   3rd Qu.:1750   3rd Qu.:1914   3rd Qu.:1920  
##  Max.   :2400   Max.   :2400   Max.   :2400   Max.   :2359  
##     ArrTime       Cancelled    Diverted CRSElapsedTime  ActualElapsedTime
##  Min.   :   1   Min.   :0   Min.   :0   Min.   : 21.0   Min.   : 15.0    
##  1st Qu.:1108   1st Qu.:0   1st Qu.:0   1st Qu.: 90.0   1st Qu.: 85.0    
##  Median :1522   Median :0   Median :0   Median :128.0   Median :122.0    
##  Mean   :1488   Mean   :0   Mean   :0   Mean   :146.4   Mean   :140.1    
##  3rd Qu.:1919   3rd Qu.:0   3rd Qu.:0   3rd Qu.:180.0   3rd Qu.:173.0    
##  Max.   :2400   Max.   :0   Max.   :0   Max.   :705.0   Max.   :721.0    
##     Distance     
##  Min.   :  31.0  
##  1st Qu.: 391.0  
##  Median : 679.0  
##  Mean   : 843.8  
##  3rd Qu.:1089.0  
##  Max.   :4983.0

Check for (ArrTime > DepTime) to test for Arrival Times that may occure before Departure times.

#Checking for neagative flight times
pandoc.table(flights %>% count(ArrTime > DepTime), style = "grid")
## 
## 
## +---------------------+--------+
## |  ArrTime > DepTime  |   n    |
## +=====================+========+
## |        FALSE        | 16172  |
## +---------------------+--------+
## |        TRUE         | 417126 |
## +---------------------+--------+

Data Preparation

Convert the FlightDate field to a Date clase

#This code chunk was taken directly from the lecure notes.

#Converts FlightDate to a Date class
flights$FlightDate <- as.Date(flights$FlightDate, format = "%m/%d/%Y")

Using the sprintf() function to add leading zeros to allow creation of date/time variables
The

#This code chunk was taken directly from the lecure notes.

#Add leading zeros to allow creation of date/time variables, then paste the Date information from FlightDate
flights <- flights %>% 
  mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
flights$new_CRSDepTime <- as.POSIXct(flights$new_CRSDepTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(CRSDepTime, new_CRSDepTime), style = "grid"))
## 
## --------------------------------
##  CRSDepTime    new_CRSDepTime   
## ------------ -------------------
##     1100     2016-01-06 11:00:00
## 
##     1100     2016-01-07 11:00:00
## 
##     1100     2016-01-08 11:00:00
## 
##     1100     2016-01-09 11:00:00
## 
##     1100     2016-01-10 11:00:00
## 
##     1100     2016-01-11 11:00:00
## --------------------------------

The sprintf() function is repeated for: DepTime, WheelsOff, WheelsOn, CRSArrTime, and ArrTime
Note: It is not necessary to apply the sprintf() function to CRSElapsedTime and ActualElapsedTime They are both stored as “minutes”.

#DepTime
flights <- flights %>% 
  mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)))
flights$new_DepTime <- as.POSIXct(flights$new_DepTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(DepTime, new_DepTime), style = "grid"))
## 
## -----------------------------
##  DepTime      new_DepTime    
## --------- -------------------
##   1057    2016-01-06 10:57:00
## 
##   1056    2016-01-07 10:56:00
## 
##   1055    2016-01-08 10:55:00
## 
##   1102    2016-01-09 11:02:00
## 
##   1240    2016-01-10 12:40:00
## 
##   1107    2016-01-11 11:07:00
## -----------------------------
#WheelsOff
flights <- flights %>% 
  mutate(new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)))
flights$new_WheelsOff <- as.POSIXct(flights$new_WheelsOff, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(WheelsOff, new_WheelsOff), style = "grid"))
## 
## -------------------------------
##  WheelsOff     new_WheelsOff   
## ----------- -------------------
##    1112     2016-01-06 11:12:00
## 
##    1110     2016-01-07 11:10:00
## 
##    1116     2016-01-08 11:16:00
## 
##    1115     2016-01-09 11:15:00
## 
##    1300     2016-01-10 13:00:00
## 
##    1118     2016-01-11 11:18:00
## -------------------------------
#WheelsOn
flights <- flights %>% 
  mutate(new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)))
flights$new_WheelsOn <- as.POSIXct(flights$new_WheelsOn, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(WheelsOn, new_WheelsOn), style = "grid"))
## 
## ------------------------------
##  WheelsOn     new_WheelsOn    
## ---------- -------------------
##    1424    2016-01-06 14:24:00
## 
##    1416    2016-01-07 14:16:00
## 
##    1431    2016-01-08 14:31:00
## 
##    1424    2016-01-09 14:24:00
## 
##    1617    2016-01-10 16:17:00
## 
##    1426    2016-01-11 14:26:00
## ------------------------------
#CRSArrTime
flights <- flights %>% 
  mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
flights$new_CRSArrTime <- as.POSIXct(flights$new_CRSArrTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(CRSArrTime, new_CRSArrTime), style = "grid"))
## 
## --------------------------------
##  CRSArrTime    new_CRSArrTime   
## ------------ -------------------
##     1438     2016-01-06 14:38:00
## 
##     1438     2016-01-07 14:38:00
## 
##     1438     2016-01-08 14:38:00
## 
##     1438     2016-01-09 14:38:00
## 
##     1438     2016-01-10 14:38:00
## 
##     1438     2016-01-11 14:38:00
## --------------------------------
#ArrTime
flights <- flights %>% 
  mutate(new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
flights$new_ArrTime <- as.POSIXct(flights$new_ArrTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(ArrTime, new_ArrTime), style = "grid"))
## 
## -----------------------------
##  ArrTime      new_ArrTime    
## --------- -------------------
##   1432    2016-01-06 14:32:00
## 
##   1426    2016-01-07 14:26:00
## 
##   1445    2016-01-08 14:45:00
## 
##   1433    2016-01-09 14:33:00
## 
##   1631    2016-01-10 16:31:00
## 
##   1435    2016-01-11 14:35:00
## -----------------------------

Data Calculations

In this section, new fields are created using the following calculations:

DepDelay = new_DepTime - new_CRSDepTime
DepDelayMinutes - ifelse(DepDelay < 0, 0, DepDelay)
DepDel15 - ifelse(DepDelay >= 15, 1, 0)
TaxiOut - new_WheelsOff - new_DepTime
TaxiIn - new_ArrTime - new_WheelsOn
ArrDelay - new_ArrTime - new_ArrDepTime
ArrDelayMinutes - ifelse(ArrDelay < 0, 0, ArrDelay)
ArrDel15 - ifelse(ArrDelay >= 15, 1, 0)
FlightTimeBuffer - CRSElapsedTime - ActualElapsedTime
AirTime - ActualElapsedTime - TaxiOut - TaxiIn
AirSpeed - Distance / (Airtime / 60)


Calculating Date/Time Objects

Using the difftime() function to calculate the differnce between two date/time objects.
The as.interger() function is used to store the result as in integer.

flights <- flights %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))

pandoc.table(head(flights %>% select(CRSDepTime, DepTime, DepDelay), style = "grid"))
## 
## ---------------------------------
##  CRSDepTime   DepTime   DepDelay 
## ------------ --------- ----------
##     1100       1057        -3    
## 
##     1100       1056        -4    
## 
##     1100       1055        -5    
## 
##     1100       1102        2     
## 
##     1100       1240       100    
## 
##     1100       1107        7     
## ---------------------------------

Calculating Date/Time Objects

Using the ifelse() function to separate flights with delays into two categories.
Variable DepDelayMinutes for delays less then 30 minutes and DeDel15 for delays 30 minutes or more.

flights <- flights %>% mutate(DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay), 
         DepDel30 = ifelse(DepDelay >= 30, 1, 0))

pandoc.table(head(flights %>% select(DepDelay, DepDelayMinutes, DepDel30), style = "grid"))
## 
## ---------------------------------------
##  DepDelay   DepDelayMinutes   DepDel30 
## ---------- ----------------- ----------
##     -3             0             0     
## 
##     -4             0             0     
## 
##     -5             0             0     
## 
##     2              2             0     
## 
##    100            100            1     
## 
##     7              7             0     
## ---------------------------------------

Check the number of flights with more than a 30 minute delay.

pandoc.table(flights %>% count(DepDel30 == 1), style = 'grid')
## 
## 
## +-----------------+--------+
## |  DepDel30 == 1  |   n    |
## +=================+========+
## |      FALSE      | 391307 |
## +-----------------+--------+
## |      TRUE       | 41991  |
## +-----------------+--------+

There are 41991 record for flights with a delay of 30 minutes or more

Calculating the inboud and outboud taxi times and in minutes.

#TaxiOut
#TaxiIn
flights <- flights %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")), 
                           TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")))

pandoc.table(head(flights %>% select(TaxiIn, TaxiOut), style = "grid"))
## 
## ------------------
##  TaxiIn   TaxiOut 
## -------- ---------
##    8        15    
## 
##    10       14    
## 
##    14       21    
## 
##    9        13    
## 
##    14       20    
## 
##    9        11    
## ------------------

Calculating the difference in minutes between scheduled and actual arrival time.
Early arrivals show negative numbers.

#ArrDelay
flights <- flights %>% mutate(ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")))

pandoc.table(head(flights %>% select(ArrDelay), style = "grid"))
## 
## ----------
##  ArrDelay 
## ----------
##     -6    
## 
##    -12    
## 
##     7     
## 
##     -5    
## 
##    113    
## 
##     -3    
## ----------

Calculating the difference in minutes between scheduled and actual arrival time with early arrivals set to 0.
Creating an Arrival Delay Indicator, 15 Minutes or More (1=Yes)

#ArrDelayMinutes
#ArrDel30
flights <- flights %>% mutate( ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay), 
         ArrDel30 = ifelse(ArrDelay >= 30, 1, 0))

pandoc.table(head(flights %>% select(ArrDelayMinutes, ArrDel30), style = "grid"))
## 
## ----------------------------
##  ArrDelayMinutes   ArrDel30 
## ----------------- ----------
##         0             0     
## 
##         0             0     
## 
##         7             0     
## 
##         0             0     
## 
##        113            1     
## 
##         0             0     
## ----------------------------

Arithmetic Calculations

FlightTimeBuffer: The difference between scheduled elapsed time and actual elapsed time.
AirTime: The time spent in the air.
AirSpeed: Average speed of the plane in flight (mph)

#FlightTimeBuffer
flights <- flights %>% mutate(FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)

#AirTime
flights <- flights %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn)

#AirSpeed
flights <- flights %>% mutate(AirSpeed = Distance / (AirTime / 60))

pandoc.table(head(flights %>% select(FlightTimeBuffer, AirTime, AirSpeed), style = "grid"))
## 
## ---------------------------------------
##  FlightTimeBuffer   AirTime   AirSpeed 
## ------------------ --------- ----------
##         3             132      448.2   
## 
##         8             126      469.5   
## 
##        -12            135      438.2   
## 
##         7             129      458.6   
## 
##        -13            137      431.8   
## 
##         10            128      462.2   
## ---------------------------------------

Analysis

Compare Taxi Time, Air Time, and Air Speed for departure dealys.


Data Query

Query Delayed Flights for Analysis

# Create the new data frame:  delays
# Query flights with 30 minute or more departure delays
delays <- flights %>% select(FlightDate, Carrier, Origin, OriginState, Dest, TaxiOut, TaxiIn, AirTime, AirSpeed, DestState, DepDelay, DepDel30) %>%
  filter(DepDel30 == 1) %>% arrange(FlightDate)

kable(head(delays[1:9]), align = 'c')
FlightDate Carrier Origin OriginState Dest TaxiOut TaxiIn AirTime AirSpeed
2016-01-01 AA LAX CA DCA 17 4 253 548.0632
2016-01-01 AA DFW TX MIA 16 4 124 542.4194
2016-01-01 AA MIA FL DFW 12 8 159 423.0189
2016-01-01 AA MCO FL JFK 18 5 117 484.1026
2016-01-01 AA SLC UT DFW 10 5 122 486.3934
2016-01-01 AA DFW TX SLC 11 5 137 433.1387

More Data Preparation

After reviewing the data queries, I realized some flights have negative taxi time and will need to be removed.
Records with airtimes greater than 600 minutes will also be removed.

negTaxiOut <- flights %>% filter(TaxiOut <= 0)
nrow(negTaxiOut)
## [1] 1088
negTaxiIn <- flights %>% filter(TaxiOut <= 0)
nrow(negTaxiIn)
## [1] 1088
longAirTime <- flights %>% filter(AirTime > 600)
nrow(longAirTime)
## [1] 2512

Removing 1088 Records with negative taxi Times and 2512 Records with airtimes greater than 600 minutes.

delays <- delays %>% filter(TaxiOut > 0) %>%  filter(TaxiOut > 0) %>%  filter(AirTime < 600)

Regression Models

With the suspicious data removed, I can revisit my regression models.
The plot comparisons look much better than before.

# Plot matrix of all variables.
plot(delays %>% select(TaxiOut, TaxiIn, AirTime),
       pch=9, col="blue", main="Matrix Scatterplot of TaxiOut, TaxiIn, and AirTime")


Linear Regression Models to predict the conditional probability distribution of departure delays based on:
Taxi Time and Air Time

#Carrier
TaxiOut.lm1 <- lm(DepDelay ~TaxiOut, data = delays)

#TaxiIn
TaxiIn.lm1 <- lm(DepDelay ~TaxiIn, data = delays)

#AirTime
AirTime.lm1 <- lm(DepDelay ~AirTime, data = delays)

Significance of the coefficient and goodness of fit for each variable.

Create a data frame to view the results,

Predictor <- c("TaxiOut", "TaxiIn", "AirTime")

rSquared <- c(glance(TaxiOut.lm1)$r.squared,
               glance(TaxiIn.lm1)$r.squared,
               glance(AirTime.lm1)$r.squared)

pValue <- c(glance(TaxiOut.lm1)$p.value,
               glance(TaxiIn.lm1)$p.value,
               glance(AirTime.lm1)$p.value)

FitTable <- data.frame(Predictor, rSquared, pValue)

pandoc.table(FitTable, style = "grid")
## 
## 
## +-------------+------------+-----------+
## |  Predictor  |  rSquared  |  pValue   |
## +=============+============+===========+
## |   TaxiOut   |  0.003515  | 1.529e-33 |
## +-------------+------------+-----------+
## |   TaxiIn    |  0.001346  | 8.29e-14  |
## +-------------+------------+-----------+
## |   AirTime   | 3.318e-05  |  0.2413   |
## +-------------+------------+-----------+

The R^2 Values Clearly indicate there is no goodness of fit.

We can plot the TaxiOut time and clearly see it is not a good predictor of Departure Delays.

delays %>%
  ggvis(~TaxiOut, ~DepDelay) %>% 
  layer_points(fill := "blue", size := 10) %>%
  layer_model_predictions(model = "lm", se = TRUE, stroke := "red")

Now I will check the conditional probability distribution of air speed based on departure delays.

But first I am going to narrow the delay time to a maximum of 200 minutes.

delays2 <- delays %>% filter(DepDelay < 200)

Linear Regression Model

#AirSpeed
AirSpeed.lm2 <- lm(AirSpeed ~DepDelay, data = delays2)

Significance of the coefficient and goodness of fit.

Create a data frame to view the results,

Predictor <- c("AirSpeed")

rSquared <- c(glance(AirSpeed.lm2)$r.squared)

pValue <- c(glance(AirSpeed.lm2)$p.value)

FitTable2 <- data.frame(Predictor, rSquared, pValue)

pandoc.table(FitTable2, style = "grid")
## 
## 
## +-------------+------------+----------+
## |  Predictor  |  rSquared  |  pValue  |
## +=============+============+==========+
## |  AirSpeed   |  0.001414  | 8.66e-14 |
## +-------------+------------+----------+

The R^2 Values Clearly indicate there is no goodness of fit.

We can plot the TaxiOut time and clearly see it is not a good predictor of Departure Delays.

delays2 %>%
  ggvis(~DepDelay, ~AirSpeed) %>% 
  layer_points(fill := "blue", size := 10) %>%
  layer_model_predictions(model = "lm", se = TRUE, stroke := "red")

Still not a good fit.