In order to answer above questions, we are going to analyze the provided dataset, containing up to 1936758 ### different internal flights in the US for 2008 and their causes for delay, diversion and cancellation
The data comes from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS). Meta data explanations
This dataset is composed by the following variables:
Year 2008 Month 1-12 DayofMonth 1-31 DayOfWeek 1 (Monday) - 7 (Sunday) DepTime actual departure time (local, hhmm) CRSDepTime scheduled departure time (local, hhmm) ArrTime actual arrival time (local, hhmm) CRSArrTime scheduled arrival time (local, hhmm) UniqueCarrier unique carrier code FlightNum flight number TailNum plane tail number: aircraft registration, unique aircraft identifier ActualElapsedTime in minutes CRSElapsedTime in minutes AirTime in minutes ArrDelay arrival delay, in minutes: A flight is counted as “on time” if it operated less than 15 minutes later the scheduled time shown in the carriers’ Computerized Reservations Systems (CRS). DepDelay departure delay, in minutes Origin origin IATA airport code Dest destination IATA airport code Distance in miles TaxiIn taxi in time, in minutes TaxiOut taxi out time in minutes Cancelled *was the flight cancelled CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security) Diverted 1 = yes, 0 = no CarrierDelay in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays. WeatherDelay in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival. NASDelay in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc. SecurityDelay in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas. LateAircraftDelay in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation.
Load Packages
library(tidyr)
library(dplyr )
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(scales) # Percentage calculation
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
data <- read_csv("~/Desktop/Flight Delays 2008/DelayedFlights.csv") # Data input
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_integer(),
## Year = col_integer(),
## Month = col_integer(),
## DayofMonth = col_integer(),
## DayOfWeek = col_integer(),
## CRSDepTime = col_integer(),
## CRSArrTime = col_integer(),
## UniqueCarrier = col_character(),
## FlightNum = col_integer(),
## TailNum = col_character(),
## Origin = col_character(),
## Dest = col_character(),
## Distance = col_integer(),
## Cancelled = col_integer(),
## CancellationCode = col_character(),
## Diverted = col_integer()
## )
## See spec(...) for full column specifications.
summary(data)
## X1 Year Month DayofMonth
## Min. : 0 Min. :2008 Min. : 1.000 Min. : 1.00
## 1st Qu.:1517452 1st Qu.:2008 1st Qu.: 3.000 1st Qu.: 8.00
## Median :3242558 Median :2008 Median : 6.000 Median :16.00
## Mean :3341651 Mean :2008 Mean : 6.111 Mean :15.75
## 3rd Qu.:4972467 3rd Qu.:2008 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :7009727 Max. :2008 Max. :12.000 Max. :31.00
##
## DayOfWeek DepTime CRSDepTime ArrTime
## Min. :1.000 Min. : 1 Min. : 0 Min. : 1
## 1st Qu.:2.000 1st Qu.:1203 1st Qu.:1135 1st Qu.:1316
## Median :4.000 Median :1545 Median :1510 Median :1715
## Mean :3.985 Mean :1519 Mean :1467 Mean :1610
## 3rd Qu.:6.000 3rd Qu.:1900 3rd Qu.:1815 3rd Qu.:2030
## Max. :7.000 Max. :2400 Max. :2359 Max. :2400
## NA's :7110
## CRSArrTime UniqueCarrier FlightNum TailNum
## Min. : 0 Length:1936758 Min. : 1 Length:1936758
## 1st Qu.:1325 Class :character 1st Qu.: 610 Class :character
## Median :1705 Mode :character Median :1543 Mode :character
## Mean :1634 Mean :2184
## 3rd Qu.:2014 3rd Qu.:3422
## Max. :2400 Max. :9742
##
## ActualElapsedTime CRSElapsedTime AirTime ArrDelay
## Min. : 14.0 Min. :-25.0 Min. : 0.0 Min. :-109.0
## 1st Qu.: 80.0 1st Qu.: 82.0 1st Qu.: 58.0 1st Qu.: 9.0
## Median : 116.0 Median :116.0 Median : 90.0 Median : 24.0
## Mean : 133.3 Mean :134.3 Mean : 108.3 Mean : 42.2
## 3rd Qu.: 165.0 3rd Qu.:165.0 3rd Qu.: 137.0 3rd Qu.: 56.0
## Max. :1114.0 Max. :660.0 Max. :1091.0 Max. :2461.0
## NA's :8387 NA's :198 NA's :8387 NA's :8387
## DepDelay Origin Dest Distance
## Min. : 6.00 Length:1936758 Length:1936758 Min. : 11.0
## 1st Qu.: 12.00 Class :character Class :character 1st Qu.: 338.0
## Median : 24.00 Mode :character Mode :character Median : 606.0
## Mean : 43.19 Mean : 765.7
## 3rd Qu.: 53.00 3rd Qu.: 998.0
## Max. :2467.00 Max. :4962.0
##
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 0.000 Min. : 0.00 Min. :0.0000000 Length:1936758
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.0000000 Class :character
## Median : 6.000 Median : 14.00 Median :0.0000000 Mode :character
## Mean : 6.813 Mean : 18.23 Mean :0.0003268
## 3rd Qu.: 8.000 3rd Qu.: 21.00 3rd Qu.:0.0000000
## Max. :240.000 Max. :422.00 Max. :1.0000000
## NA's :7110 NA's :455
## Diverted CarrierDelay WeatherDelay NASDelay
## Min. :0.000000 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.:0.000000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0
## Median :0.000000 Median : 2.0 Median : 0.0 Median : 2
## Mean :0.004004 Mean : 19.2 Mean : 3.7 Mean : 15
## 3rd Qu.:0.000000 3rd Qu.: 21.0 3rd Qu.: 0.0 3rd Qu.: 15
## Max. :1.000000 Max. :2436.0 Max. :1352.0 Max. :1357
## NA's :689270 NA's :689270 NA's :689270
## SecurityDelay LateAircraftDelay
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 8.0
## Mean : 0.1 Mean : 25.3
## 3rd Qu.: 0.0 3rd Qu.: 33.0
## Max. :392.0 Max. :1316.0
## NA's :689270 NA's :689270
head(data)
## # A tibble: 6 x 30
## X1 Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime
## <int> <int> <int> <int> <int> <dbl> <int> <dbl>
## 1 0 2008 1 3 4 2003 1955 2211
## 2 1 2008 1 3 4 754 735 1002
## 3 2 2008 1 3 4 628 620 804
## 4 4 2008 1 3 4 1829 1755 1959
## 5 5 2008 1 3 4 1940 1915 2121
## 6 6 2008 1 3 4 1937 1830 2037
## # ... with 22 more variables: CRSArrTime <int>, UniqueCarrier <chr>,
## # FlightNum <int>, TailNum <chr>, ActualElapsedTime <dbl>,
## # CRSElapsedTime <dbl>, AirTime <dbl>, ArrDelay <dbl>, DepDelay <dbl>,
## # Origin <chr>, Dest <chr>, Distance <int>, TaxiIn <dbl>, TaxiOut <dbl>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>,
## # CarrierDelay <dbl>, WeatherDelay <dbl>, NASDelay <dbl>,
## # SecurityDelay <dbl>, LateAircraftDelay <dbl>
Only when Arrival Delay is longer than 15 minutes there’s data about what caused the delay. Arrival Delay is the sum of CarrierDelay, WeatherDelay, NASDelay and LateAircraftDelay. More often than not, airports and carriers allocate a CRSElapsedTime higher than the actual time spent in the Taxi In + Taxi out + Airtime operations (Actual Elapsed Time). This is the reason why, when planes take off on time, landing usually takes place before the expected time! It also allows to absorb delay by late aircraft down the lane of chained flights.
data$WeekDay <- data$DayOfWeek
data$MonthQual <- data$Month
# Define levels
data$UniqueCarrier <- factor(data$UniqueCarrier)
data$Year <- factor(data$Year)
data$Month <- factor(data$Month)
data$Origin <- factor(data$Origin)
data$Dest <- factor(data$Dest)
# Define numbers by actual days and months.
data$WeekDay[data$WeekDay == 1] <- 'Monday'
data$WeekDay[data$WeekDay == 2] <- 'Tuesday'
data$WeekDay[data$WeekDay == 3] <- 'Wednesday'
data$WeekDay[data$WeekDay == 4] <- 'Thursday'
data$WeekDay[data$WeekDay == 5] <- 'Friday'
data$WeekDay[data$WeekDay == 6] <- 'Saturday'
data$WeekDay[data$WeekDay == 7] <- 'Sunday'
data$MonthQual[data$MonthQual == 1] <- 'January'
data$MonthQual[data$MonthQual == 2] <- 'February'
data$MonthQual[data$MonthQual == 3] <- 'March'
data$MonthQual[data$MonthQual == 4] <- 'April'
data$MonthQual[data$MonthQual == 5] <- 'May'
data$MonthQual[data$MonthQual == 6] <- 'June'
data$MonthQual[data$MonthQual == 7] <-'July'
data$MonthQual[data$MonthQual == 8] <- 'August'
data$MonthQual[data$MonthQual == 9] <- 'September'
data$MonthQual[data$MonthQual == 10] <- 'October'
data$MonthQual[data$MonthQual == 11] <- 'November'
data$MonthQual[data$MonthQual == 12] <- 'December'
ggplot(data, aes(Month, fill = UniqueCarrier)) +
geom_bar(width=0.8, position="dodge",color="black")+
labs(x = "Month", y = "Count",title = "Flight Delay Counts by Airline Carriers", fill = "Airline Carriers")+
theme(legend.text = element_text(colour="blue", size=10,
face="bold"))
Frequent delays from WN and AA, Southwest and American Airline.
data %>% group_by(WeekDay) %>%
tally %>% arrange(desc(n))
## # A tibble: 7 x 2
## WeekDay n
## <chr> <int>
## 1 Friday 323259
## 2 Monday 290933
## 3 Thursday 289451
## 4 Sunday 286111
## 5 Wednesday 262805
## 6 Tuesday 260943
## 7 Saturday 223256
The most common day for flight delays is on friday, which isn’t surprising as it’s also one of the most common days to fly.
data %>% group_by(MonthQual) %>%
tally %>% arrange(desc(n))
## # A tibble: 12 x 2
## MonthQual n
## <chr> <int>
## 1 December 203385
## 2 June 200914
## 3 March 200842
## 4 February 189534
## 5 January 183527
## 6 July 182945
## 7 August 162648
## 8 April 155264
## 9 May 153072
## 10 November 105563
## 11 October 103525
## 12 September 95539
The most common month for flight delays is in December.
ggplot(data, aes(x=DayofMonth, fill=DayofMonth))+geom_bar()
The most common data in a month for flight delays is on 22nd.
data$Cancelled[data$Cancelled == 0] = 'No'
data$Cancelled[data$Cancelled == 1] = 'Yes'
qplot(factor(Cancelled), data=data, geom="bar", fill=factor(Cancelled))
# Flight Cancelation comparisons
data %>% group_by(Cancelled) %>%
tally %>% arrange(desc(n))
## # A tibble: 2 x 2
## Cancelled n
## <chr> <int>
## 1 No 1936125
## 2 Yes 633
percent(633/19361125)
## [1] "0.00327%"
# 0.00327% chance that you will have a flight cancelled.
It is very unlikely. 0.00327% chance that you will have a flight cancelled.
data$CancellationCode[data$CancellationCode == 'A'] = 'Carrier'
data$CancellationCode[data$CancellationCode == 'B'] = 'Weather'
data$CancellationCode[data$CancellationCode == 'C'] = 'NAS'
data$CancellationCode[data$CancellationCode == 'D'] = 'Security'
data %>% filter(CancellationCode != 'N') %>%
group_by(CancellationCode) %>%
tally %>%
arrange(desc(n))
## # A tibble: 3 x 2
## CancellationCode n
## <chr> <int>
## 1 Weather 307
## 2 Carrier 246
## 3 NAS 80
# Weather is the bigggest reason
CancelledSubset = subset(data, CancellationCode != 'N')
#ggplot(diamonds, aes(color, fill=cut)) + geom_bar() + coord_flip()
ggplot(CancelledSubset,aes(MonthQual,fill=CancellationCode)) + geom_bar()
#qplot(factor(MonthQual),data=CancelledSubset)
Weather is the most common reason for a flight cancellation. The majority of cancellations are in November and December. Considering the timing of the cancellations, it appears that some cancellations related to the carrier could also be weather related.
What do the top 1 in a million arrival delays (in minutes) look like? We’ll be increasing in factors of 10. In other words, we’ll calculate the top 1/10, 1/100, 1/1,000, etc… all the way up to 1/1,000,000,000. It’s important to note that these are “A given B” estimates. In other words, it’s the probability of the ____ delay length, given that such a ____ delay has occurred. It is not the overall probability that any given flight will experience such a delay.
In this script, we’ll try to identify extreme cases of features for flight delays. For example, what do the top 1 in a million arrival delays (in minutes) look like?
We’ll be increasing in factors of 10. In other words, we’ll calculate the top 1/10, 1/100, 1/1,000, etc… all the way up to 1/1,000,000,000.
It’s important to note that these are “A given B” estimates. In other words, it’s the probability of the ____ delay length, given that such a ____ delay has occurred. It is not the overall probability that any given flight will experience such a delay.
Setup and Functions
library(knitr)
flights <- read.csv("~/Desktop/Flight Delays 2008/DelayedFlights.csv")
oneinlog <- function(tmpset, precision){
tmean <- mean(tmpset)
tsd <- sd(tmpset)
tn <- length(tmpset)
t1 <- tmean + (gettval(1:9, tn) * tsd)
pdf <- as.data.frame(cbind(paste0("1 in 10^",1:9), round(t1, precision)))
names(pdf) <- c("Probability", "Value (minutes)")
pdf
}
gettval <- function(x, n)
{
qt(1/10^x, df=(n-1), lower.tail=FALSE)
}
explainer <- function(label, tmpdf, tmpdata){
paste0("The top 1% of flights have a ", label, " of over ", tmpdf[1,2], " minutes. And there is less than a 1 in a Billion chance of finding a flight with a ", label, " of over ", tmpdf[9,2], " minutes. <br><br>Mean is ", round(mean(tmpdata), 2), " minutes, with a Standard Deviation of ", round(sd(tmpdata), 2), " minutes.")
}
tmpArrDelay <- flights[!is.na(flights$ArrDelay),]$ArrDelay
tmpArrDelay <- tmpArrDelay[tmpArrDelay > 0]
tmpDepDelay <- flights[!is.na(flights$DepDelay),]$DepDelay
tmpDepDelay <- tmpDepDelay[tmpDepDelay > 0]
tmpWeatherDelay <- flights[!is.na(flights$WeatherDelay),]$WeatherDelay
tmpWeatherDelay <- tmpWeatherDelay[tmpWeatherDelay > 0]
tmpCarrierDelay <- flights[!is.na(flights$CarrierDelay),]$CarrierDelay
tmpCarrierDelay <- tmpCarrierDelay[tmpCarrierDelay > 0]
tmpNASDelay <- flights[!is.na(flights$NASDelay),]$NASDelay
tmpNASDelay <- tmpNASDelay[tmpNASDelay > 0]
tmpSecurityDelay <- flights[!is.na(flights$SecurityDelay),]$SecurityDelay
tmpSecurityDelay <- tmpSecurityDelay[tmpSecurityDelay > 0]
tmpLateAircraftDelay <- flights[!is.na(flights$LateAircraftDelay),]$LateAircraftDelay
tmpLateAircraftDelay <- tmpSecurityDelay[tmpLateAircraftDelay > 0]
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 121.49 |
| 1 in 10^2 | 181.46 |
| 1 in 10^3 | 225.3 |
| 1 in 10^4 | 261.39 |
| 1 in 10^5 | 292.73 |
| 1 in 10^6 | 320.77 |
| 1 in 10^7 | 346.36 |
| 1 in 10^8 | 370.05 |
| 1 in 10^9 | 392.19 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 111.62 |
| 1 in 10^2 | 167.42 |
| 1 in 10^3 | 208.21 |
| 1 in 10^4 | 241.79 |
| 1 in 10^5 | 270.94 |
| 1 in 10^6 | 297.03 |
| 1 in 10^7 | 320.84 |
| 1 in 10^8 | 342.88 |
| 1 in 10^9 | 363.48 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 125.89 |
| 1 in 10^2 | 190.44 |
| 1 in 10^3 | 237.64 |
| 1 in 10^4 | 276.49 |
| 1 in 10^5 | 310.22 |
| 1 in 10^6 | 340.41 |
| 1 in 10^7 | 367.96 |
| 1 in 10^8 | 393.47 |
| 1 in 10^9 | 417.31 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 106.34 |
| 1 in 10^2 | 163.32 |
| 1 in 10^3 | 204.98 |
| 1 in 10^4 | 239.27 |
| 1 in 10^5 | 269.04 |
| 1 in 10^6 | 295.69 |
| 1 in 10^7 | 320.01 |
| 1 in 10^8 | 342.51 |
| 1 in 10^9 | 363.55 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 82.62 |
| 1 in 10^2 | 126.78 |
| 1 in 10^3 | 159.06 |
| 1 in 10^4 | 185.64 |
| 1 in 10^5 | 208.71 |
| 1 in 10^6 | 229.35 |
| 1 in 10^7 | 248.2 |
| 1 in 10^8 | 265.64 |
| 1 in 10^9 | 281.95 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | 47.42 |
| 1 in 10^2 | 70.81 |
| 1 in 10^3 | 87.93 |
| 1 in 10^4 | 102.03 |
| 1 in 10^5 | 114.27 |
| 1 in 10^6 | 125.23 |
| 1 in 10^7 | 135.25 |
| 1 in 10^8 | 144.52 |
| 1 in 10^9 | 153.19 |
| Probability | Value (minutes) |
|---|---|
| 1 in 10^1 | NA |
| 1 in 10^2 | NA |
| 1 in 10^3 | NA |
| 1 in 10^4 | NA |
| 1 in 10^5 | NA |
| 1 in 10^6 | NA |
| 1 in 10^7 | NA |
| 1 in 10^8 | NA |
| 1 in 10^9 | NA |
data$DepTime = floor(data$DepTime/100)*60+ (data$DepTime%%100)
data$CRSDepTime = floor(data$CRSDepTime/100)*60+ (data$CRSDepTime%%100)
data$ArrTime = floor(data$ArrTime/100)*60+ (data$ArrTime%%100)
data$CRSArrTime = floor(data$CRSDepTime/100)*60+ (data$CRSArrTime%%100)
data$CRSDepTime = floor(data$CRSDepTime/100)*60+ (data$CRSDepTime%%100)
(NROW(data) - NROW(data[complete.cases(data),]))/NROW(data)
## [1] 0.3558896
data <- data[complete.cases(data),]
sdata <- data[sample(NROW(data), 10000),]
ggplot(sdata, aes(x=CRSElapsedTime, y=ActualElapsedTime))+geom_point(alpha=0.3, colour="blue")+geom_abline(colour="red", intercept=0, slope=1)
var <- c("DepDelay", "Distance", "CRSDepTime", "CRSElapsedTime")
toReg <- sdata[,var]
reg <- lm(DepDelay~., data= toReg)
summary(reg)
##
## Call:
## lm(formula = DepDelay ~ ., data = toReg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.42 -35.21 -17.60 16.08 961.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.886144 2.953042 11.136 < 2e-16 ***
## Distance -0.043635 0.005380 -8.110 5.65e-16 ***
## CRSDepTime 0.018898 0.003827 4.938 8.00e-07 ***
## CRSElapsedTime 0.371102 0.043276 8.575 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.42 on 9996 degrees of freedom
## Multiple R-squared: 0.01005, Adjusted R-squared: 0.009754
## F-statistic: 33.83 on 3 and 9996 DF, p-value: < 2.2e-16
Clearly it is impossible to determine a correlation between the delay and any variable as it is mostly random…