1. Understanding and preparing the data

In order to answer above questions, we are going to analyze the provided dataset, containing up to 1936758 ### different internal flights in the US for 2008 and their causes for delay, diversion and cancellation

The data comes from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS). Meta data explanations

This dataset is composed by the following variables:

Year 2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime actual departure time (local, hhmm) CRSDepTime scheduled departure time (local, hhmm) ArrTime actual arrival time (local, hhmm) CRSArrTime scheduled arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number: aircraft registration, unique aircraft identifier ActualElapsedTime in minutes CRSElapsedTime in minutes AirTime in minutes ArrDelay arrival delay, in minutes: A flight is counted as “on time” if it operated less than 15 minutes later the scheduled time shown in the carriers’ Computerized Reservations Systems (CRS). DepDelay departure delay, in minutes Origin origin IATA airport code Dest destination IATA airport code Distance in miles TaxiIn taxi in time, in minutes TaxiOut taxi out time in minutes Cancelled *was the flight cancelled CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) Diverted 1 = yes, 0 = no CarrierDelay in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays. WeatherDelay in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival. NASDelay in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.  SecurityDelay in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas. LateAircraftDelay in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation.

Load Packages

library(tidyr)
library(dplyr )
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(scales) # Percentage calculation
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data <- read_csv("~/Desktop/Flight Delays 2008/DelayedFlights.csv") # Data input
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_integer(),
##   Year = col_integer(),
##   Month = col_integer(),
##   DayofMonth = col_integer(),
##   DayOfWeek = col_integer(),
##   CRSDepTime = col_integer(),
##   CRSArrTime = col_integer(),
##   UniqueCarrier = col_character(),
##   FlightNum = col_integer(),
##   TailNum = col_character(),
##   Origin = col_character(),
##   Dest = col_character(),
##   Distance = col_integer(),
##   Cancelled = col_integer(),
##   CancellationCode = col_character(),
##   Diverted = col_integer()
## )
## See spec(...) for full column specifications.
summary(data)
##        X1               Year          Month          DayofMonth   
##  Min.   :      0   Min.   :2008   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.:1517452   1st Qu.:2008   1st Qu.: 3.000   1st Qu.: 8.00  
##  Median :3242558   Median :2008   Median : 6.000   Median :16.00  
##  Mean   :3341651   Mean   :2008   Mean   : 6.111   Mean   :15.75  
##  3rd Qu.:4972467   3rd Qu.:2008   3rd Qu.: 9.000   3rd Qu.:23.00  
##  Max.   :7009727   Max.   :2008   Max.   :12.000   Max.   :31.00  
##                                                                   
##    DayOfWeek        DepTime       CRSDepTime      ArrTime    
##  Min.   :1.000   Min.   :   1   Min.   :   0   Min.   :   1  
##  1st Qu.:2.000   1st Qu.:1203   1st Qu.:1135   1st Qu.:1316  
##  Median :4.000   Median :1545   Median :1510   Median :1715  
##  Mean   :3.985   Mean   :1519   Mean   :1467   Mean   :1610  
##  3rd Qu.:6.000   3rd Qu.:1900   3rd Qu.:1815   3rd Qu.:2030  
##  Max.   :7.000   Max.   :2400   Max.   :2359   Max.   :2400  
##                                                NA's   :7110  
##    CRSArrTime   UniqueCarrier        FlightNum      TailNum         
##  Min.   :   0   Length:1936758     Min.   :   1   Length:1936758    
##  1st Qu.:1325   Class :character   1st Qu.: 610   Class :character  
##  Median :1705   Mode  :character   Median :1543   Mode  :character  
##  Mean   :1634                      Mean   :2184                     
##  3rd Qu.:2014                      3rd Qu.:3422                     
##  Max.   :2400                      Max.   :9742                     
##                                                                     
##  ActualElapsedTime CRSElapsedTime     AirTime          ArrDelay     
##  Min.   :  14.0    Min.   :-25.0   Min.   :   0.0   Min.   :-109.0  
##  1st Qu.:  80.0    1st Qu.: 82.0   1st Qu.:  58.0   1st Qu.:   9.0  
##  Median : 116.0    Median :116.0   Median :  90.0   Median :  24.0  
##  Mean   : 133.3    Mean   :134.3   Mean   : 108.3   Mean   :  42.2  
##  3rd Qu.: 165.0    3rd Qu.:165.0   3rd Qu.: 137.0   3rd Qu.:  56.0  
##  Max.   :1114.0    Max.   :660.0   Max.   :1091.0   Max.   :2461.0  
##  NA's   :8387      NA's   :198     NA's   :8387     NA's   :8387    
##     DepDelay          Origin              Dest              Distance     
##  Min.   :   6.00   Length:1936758     Length:1936758     Min.   :  11.0  
##  1st Qu.:  12.00   Class :character   Class :character   1st Qu.: 338.0  
##  Median :  24.00   Mode  :character   Mode  :character   Median : 606.0  
##  Mean   :  43.19                                         Mean   : 765.7  
##  3rd Qu.:  53.00                                         3rd Qu.: 998.0  
##  Max.   :2467.00                                         Max.   :4962.0  
##                                                                          
##      TaxiIn           TaxiOut         Cancelled         CancellationCode  
##  Min.   :  0.000   Min.   :  0.00   Min.   :0.0000000   Length:1936758    
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.0000000   Class :character  
##  Median :  6.000   Median : 14.00   Median :0.0000000   Mode  :character  
##  Mean   :  6.813   Mean   : 18.23   Mean   :0.0003268                     
##  3rd Qu.:  8.000   3rd Qu.: 21.00   3rd Qu.:0.0000000                     
##  Max.   :240.000   Max.   :422.00   Max.   :1.0000000                     
##  NA's   :7110      NA's   :455                                            
##     Diverted         CarrierDelay     WeatherDelay       NASDelay     
##  Min.   :0.000000   Min.   :   0.0   Min.   :   0.0   Min.   :   0    
##  1st Qu.:0.000000   1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0    
##  Median :0.000000   Median :   2.0   Median :   0.0   Median :   2    
##  Mean   :0.004004   Mean   :  19.2   Mean   :   3.7   Mean   :  15    
##  3rd Qu.:0.000000   3rd Qu.:  21.0   3rd Qu.:   0.0   3rd Qu.:  15    
##  Max.   :1.000000   Max.   :2436.0   Max.   :1352.0   Max.   :1357    
##                     NA's   :689270   NA's   :689270   NA's   :689270  
##  SecurityDelay    LateAircraftDelay
##  Min.   :  0.0    Min.   :   0.0   
##  1st Qu.:  0.0    1st Qu.:   0.0   
##  Median :  0.0    Median :   8.0   
##  Mean   :  0.1    Mean   :  25.3   
##  3rd Qu.:  0.0    3rd Qu.:  33.0   
##  Max.   :392.0    Max.   :1316.0   
##  NA's   :689270   NA's   :689270
head(data)
## # A tibble: 6 x 30
##      X1  Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime
##   <int> <int> <int>      <int>     <int>   <dbl>      <int>   <dbl>
## 1     0  2008     1          3         4    2003       1955    2211
## 2     1  2008     1          3         4     754        735    1002
## 3     2  2008     1          3         4     628        620     804
## 4     4  2008     1          3         4    1829       1755    1959
## 5     5  2008     1          3         4    1940       1915    2121
## 6     6  2008     1          3         4    1937       1830    2037
## # ... with 22 more variables: CRSArrTime <int>, UniqueCarrier <chr>,
## #   FlightNum <int>, TailNum <chr>, ActualElapsedTime <dbl>,
## #   CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## #   Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <dbl>, TaxiOut <dbl>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>,
## #   CarrierDelay <dbl>, WeatherDelay <dbl>, NASDelay <dbl>,
## #   SecurityDelay <dbl>, LateAircraftDelay <dbl>

Only when Arrival Delay is longer than 15 minutes there’s data about what caused the delay. Arrival Delay is the sum of CarrierDelay, WeatherDelay, NASDelay and LateAircraftDelay. More often than not, airports and carriers allocate a CRSElapsedTime higher than the actual time spent in the Taxi In + Taxi out + Airtime operations (Actual Elapsed Time). This is the reason why, when planes take off on time, landing usually takes place before the expected time! It also allows to absorb delay by late aircraft down the lane of chained flights.

2. Exploratory Data analysis

2.1 Which airline company should we not to choose?

data$WeekDay <- data$DayOfWeek
data$MonthQual <- data$Month

# Define levels
data$UniqueCarrier <- factor(data$UniqueCarrier)
data$Year <- factor(data$Year)
data$Month <- factor(data$Month)
data$Origin <- factor(data$Origin)
data$Dest <- factor(data$Dest)

# Define numbers by actual days and months.
data$WeekDay[data$WeekDay == 1] <- 'Monday'
data$WeekDay[data$WeekDay == 2] <- 'Tuesday'
data$WeekDay[data$WeekDay == 3] <- 'Wednesday'
data$WeekDay[data$WeekDay == 4] <- 'Thursday'
data$WeekDay[data$WeekDay == 5] <- 'Friday'
data$WeekDay[data$WeekDay == 6] <- 'Saturday'
data$WeekDay[data$WeekDay == 7] <- 'Sunday'

data$MonthQual[data$MonthQual == 1] <- 'January'
data$MonthQual[data$MonthQual == 2] <- 'February'
data$MonthQual[data$MonthQual == 3] <- 'March'
data$MonthQual[data$MonthQual == 4] <- 'April'
data$MonthQual[data$MonthQual == 5] <- 'May'
data$MonthQual[data$MonthQual == 6] <- 'June'
data$MonthQual[data$MonthQual == 7] <-'July'
data$MonthQual[data$MonthQual == 8] <- 'August'
data$MonthQual[data$MonthQual == 9] <- 'September'
data$MonthQual[data$MonthQual == 10] <- 'October'
data$MonthQual[data$MonthQual == 11] <- 'November'
data$MonthQual[data$MonthQual == 12] <- 'December'
ggplot(data, aes(Month, fill = UniqueCarrier)) + 
geom_bar(width=0.8, position="dodge",color="black")+
  labs(x = "Month", y = "Count",title = "Flight Delay Counts by Airline Carriers", fill = "Airline Carriers")+
  theme(legend.text = element_text(colour="blue", size=10, 
                                   face="bold"))

Frequent delays from WN and AA, Southwest and American Airline.

2.2 When does airline cancellation happens?

2.2.1 Which day or month have the most airline flight delays?

data %>% group_by(WeekDay) %>% 
tally %>% arrange(desc(n)) 
## # A tibble: 7 x 2
##     WeekDay      n
##       <chr>  <int>
## 1    Friday 323259
## 2    Monday 290933
## 3  Thursday 289451
## 4    Sunday 286111
## 5 Wednesday 262805
## 6   Tuesday 260943
## 7  Saturday 223256

The most common day for flight delays is on friday, which isn’t surprising as it’s also one of the most common days to fly.

2.2.2 Which month in a year has the most airline flight delays?

data %>% group_by(MonthQual) %>% 
  tally %>% arrange(desc(n))
## # A tibble: 12 x 2
##    MonthQual      n
##        <chr>  <int>
##  1  December 203385
##  2      June 200914
##  3     March 200842
##  4  February 189534
##  5   January 183527
##  6      July 182945
##  7    August 162648
##  8     April 155264
##  9       May 153072
## 10  November 105563
## 11   October 103525
## 12 September  95539

The most common month for flight delays is in December.

2.2.3 Which data in a month has the most airline flight delays?

ggplot(data, aes(x=DayofMonth, fill=DayofMonth))+geom_bar()

The most common data in a month for flight delays is on 22nd.

2.3 How likely a flight cancellation would happen?

data$Cancelled[data$Cancelled == 0] = 'No'
data$Cancelled[data$Cancelled == 1] = 'Yes'

qplot(factor(Cancelled), data=data, geom="bar", fill=factor(Cancelled))

# Flight Cancelation comparisons

data %>% group_by(Cancelled) %>%
  tally %>% arrange(desc(n))
## # A tibble: 2 x 2
##   Cancelled       n
##       <chr>   <int>
## 1        No 1936125
## 2       Yes     633
percent(633/19361125)
## [1] "0.00327%"
# 0.00327% chance that you will have a flight cancelled.

It is very unlikely. 0.00327% chance that you will have a flight cancelled.

2.4 Why flight cancelled?

data$CancellationCode[data$CancellationCode == 'A'] = 'Carrier'
data$CancellationCode[data$CancellationCode == 'B'] = 'Weather'
data$CancellationCode[data$CancellationCode == 'C'] = 'NAS'
data$CancellationCode[data$CancellationCode == 'D'] = 'Security'

data %>% filter(CancellationCode != 'N') %>%
  group_by(CancellationCode) %>%
  tally %>%
  arrange(desc(n))
## # A tibble: 3 x 2
##   CancellationCode     n
##              <chr> <int>
## 1          Weather   307
## 2          Carrier   246
## 3              NAS    80
# Weather is the bigggest reason

CancelledSubset = subset(data, CancellationCode != 'N')
#ggplot(diamonds, aes(color, fill=cut)) + geom_bar() + coord_flip()
ggplot(CancelledSubset,aes(MonthQual,fill=CancellationCode)) + geom_bar()

#qplot(factor(MonthQual),data=CancelledSubset)

Weather is the most common reason for a flight cancellation. The majority of cancellations are in November and December. Considering the timing of the cancellations, it appears that some cancellations related to the carrier could also be weather related.

2.5 How extremely late can a flight delay be?

What do the top 1 in a million arrival delays (in minutes) look like? We’ll be increasing in factors of 10. In other words, we’ll calculate the top 1/10, 1/100, 1/1,000, etc… all the way up to 1/1,000,000,000. It’s important to note that these are “A given B” estimates. In other words, it’s the probability of the ____ delay length, given that such a ____ delay has occurred. It is not the overall probability that any given flight will experience such a delay.

In this script, we’ll try to identify extreme cases of features for flight delays. For example, what do the top 1 in a million arrival delays (in minutes) look like?

We’ll be increasing in factors of 10. In other words, we’ll calculate the top 1/10, 1/100, 1/1,000, etc… all the way up to 1/1,000,000,000.

It’s important to note that these are “A given B” estimates. In other words, it’s the probability of the ____ delay length, given that such a ____ delay has occurred. It is not the overall probability that any given flight will experience such a delay.

Setup and Functions

library(knitr)
flights <- read.csv("~/Desktop/Flight Delays 2008/DelayedFlights.csv")

oneinlog <- function(tmpset, precision){ 
  tmean <- mean(tmpset)
  tsd <- sd(tmpset)
  tn <- length(tmpset)
  t1 <- tmean + (gettval(1:9, tn) * tsd)
  pdf <- as.data.frame(cbind(paste0("1 in 10^",1:9), round(t1, precision)))
  names(pdf) <- c("Probability", "Value (minutes)")
  pdf
}

gettval <- function(x, n)
{
  qt(1/10^x, df=(n-1), lower.tail=FALSE) 
}

explainer <- function(label, tmpdf, tmpdata){
  paste0("The top 1% of flights have a ", label, " of over ", tmpdf[1,2], " minutes. And there is less than a 1 in a Billion chance of finding a flight with a ", label, " of over ", tmpdf[9,2], " minutes. <br><br>Mean is ", round(mean(tmpdata), 2), " minutes, with a Standard Deviation of ", round(sd(tmpdata), 2), " minutes.")
}


tmpArrDelay <- flights[!is.na(flights$ArrDelay),]$ArrDelay
tmpArrDelay <- tmpArrDelay[tmpArrDelay > 0]

tmpDepDelay <- flights[!is.na(flights$DepDelay),]$DepDelay
tmpDepDelay <- tmpDepDelay[tmpDepDelay > 0]

tmpWeatherDelay <- flights[!is.na(flights$WeatherDelay),]$WeatherDelay
tmpWeatherDelay <- tmpWeatherDelay[tmpWeatherDelay > 0]

tmpCarrierDelay <- flights[!is.na(flights$CarrierDelay),]$CarrierDelay
tmpCarrierDelay <- tmpCarrierDelay[tmpCarrierDelay > 0]

tmpNASDelay <- flights[!is.na(flights$NASDelay),]$NASDelay
tmpNASDelay <- tmpNASDelay[tmpNASDelay > 0]

tmpSecurityDelay <- flights[!is.na(flights$SecurityDelay),]$SecurityDelay
tmpSecurityDelay <- tmpSecurityDelay[tmpSecurityDelay > 0]

tmpLateAircraftDelay <- flights[!is.na(flights$LateAircraftDelay),]$LateAircraftDelay
tmpLateAircraftDelay <- tmpSecurityDelay[tmpLateAircraftDelay > 0]

Extreme Delays for Arrivals

Chances of Arrival Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 121.49
1 in 10^2 181.46
1 in 10^3 225.3
1 in 10^4 261.39
1 in 10^5 292.73
1 in 10^6 320.77
1 in 10^7 346.36
1 in 10^8 370.05
1 in 10^9 392.19
The top 1% of flights have a Arrival Delay of over 121.49 minutes. And there is less than a 1 in a Billion chance of finding a flight with a Arrival Delay of over 392.19 minutes.

Mean is 47.93 minutes, with a Standard Deviation of 57.4 minutes.

Extreme Delays for Departures

Chances of Departude Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 111.62
1 in 10^2 167.42
1 in 10^3 208.21
1 in 10^4 241.79
1 in 10^5 270.94
1 in 10^6 297.03
1 in 10^7 320.84
1 in 10^8 342.88
1 in 10^9 363.48
The top 1% of flights have a Departure Delay of over 111.62 minutes. And there is less than a 1 in a Billion chance of finding a flight with a Departure Delay of over 363.48 minutes.

Mean is 43.19 minutes, with a Standard Deviation of 53.4 minutes.

Extreme Delays in Cases of Weather Delays

Chances of Wearher Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 125.89
1 in 10^2 190.44
1 in 10^3 237.64
1 in 10^4 276.49
1 in 10^5 310.22
1 in 10^6 340.41
1 in 10^7 367.96
1 in 10^8 393.47
1 in 10^9 417.31
The top 1% of flights have a Weather Delay of over 125.89 minutes. And there is less than a 1 in a Billion chance of finding a flight with a Weather Delay of over 417.31 minutes.

Mean is 46.71 minutes, with a Standard Deviation of 61.78 minutes.

Extreme Delays for Carrier Delays

Chances of Carrier Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 106.34
1 in 10^2 163.32
1 in 10^3 204.98
1 in 10^4 239.27
1 in 10^5 269.04
1 in 10^6 295.69
1 in 10^7 320.01
1 in 10^8 342.51
1 in 10^9 363.55
The top 1% of flights have a Carrier Delay of over 106.34 minutes. And there is less than a 1 in a Billion chance of finding a flight with a Carrier Delay of over 363.55 minutes.

Mean is 36.45 minutes, with a Standard Deviation of 54.54 minutes.

Extreme Delays for NAS Delays

Chances of NAS Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 82.62
1 in 10^2 126.78
1 in 10^3 159.06
1 in 10^4 185.64
1 in 10^5 208.71
1 in 10^6 229.35
1 in 10^7 248.2
1 in 10^8 265.64
1 in 10^9 281.95
The top 1% of flights have a NAS Delay of over 82.62 minutes. And there is less than a 1 in a Billion chance of finding a flight with a NAS Delay of over 281.95 minutes.

Mean is 28.46 minutes, with a Standard Deviation of 42.26 minutes.

Extreme Delays for Security Delays

Chances of Security Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 47.42
1 in 10^2 70.81
1 in 10^3 87.93
1 in 10^4 102.03
1 in 10^5 114.27
1 in 10^6 125.23
1 in 10^7 135.25
1 in 10^8 144.52
1 in 10^9 153.19
The top 1% of flights have a Security Delay of over 47.42 minutes. And there is less than a 1 in a Billion chance of finding a flight with a Security Delay of over 153.19 minutes.

Mean is 18.73 minutes, with a Standard Deviation of 22.38 minutes.

Extreme Delays for Late Aircraft Delays

Chances of Late Aircraft Delay of ____ minutes
Probability Value (minutes)
1 in 10^1 NA
1 in 10^2 NA
1 in 10^3 NA
1 in 10^4 NA
1 in 10^5 NA
1 in 10^6 NA
1 in 10^7 NA
1 in 10^8 NA
1 in 10^9 NA
The top 1% of flights have a Late Aircraft Delay of over NA minutes. And there is less than a 1 in a Billion chance of finding a flight with a Late Aircraft Delay of over NA minutes.

Mean is NA minutes, with a Standard Deviation of NA minutes.

Multiregression analysis

data$DepTime = floor(data$DepTime/100)*60+ (data$DepTime%%100)
data$CRSDepTime = floor(data$CRSDepTime/100)*60+ (data$CRSDepTime%%100)
data$ArrTime = floor(data$ArrTime/100)*60+ (data$ArrTime%%100)
data$CRSArrTime = floor(data$CRSDepTime/100)*60+ (data$CRSArrTime%%100)
data$CRSDepTime = floor(data$CRSDepTime/100)*60+ (data$CRSDepTime%%100)

(NROW(data) - NROW(data[complete.cases(data),]))/NROW(data)
## [1] 0.3558896
data <- data[complete.cases(data),]
sdata <- data[sample(NROW(data), 10000),]

ggplot(sdata, aes(x=CRSElapsedTime, y=ActualElapsedTime))+geom_point(alpha=0.3, colour="blue")+geom_abline(colour="red", intercept=0, slope=1)

var <- c("DepDelay", "Distance", "CRSDepTime", "CRSElapsedTime")
toReg <- sdata[,var]
reg <- lm(DepDelay~., data= toReg)
summary(reg)
## 
## Call:
## lm(formula = DepDelay ~ ., data = toReg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75.42 -35.21 -17.60  16.08 961.72 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    32.886144   2.953042  11.136  < 2e-16 ***
## Distance       -0.043635   0.005380  -8.110 5.65e-16 ***
## CRSDepTime      0.018898   0.003827   4.938 8.00e-07 ***
## CRSElapsedTime  0.371102   0.043276   8.575  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.42 on 9996 degrees of freedom
## Multiple R-squared:  0.01005,    Adjusted R-squared:  0.009754 
## F-statistic: 33.83 on 3 and 9996 DF,  p-value: < 2.2e-16

Clearly it is impossible to determine a correlation between the delay and any variable as it is mostly random…

Conclusions