This report attempted to determine a predictor metric for departure delays.
None of the variables tested were able to predict any sort of linear relationship.
FlightDate - Flight Date (yyyymmdd)
Carrier - Carrier code for the airline
TailNum - Tail Number
FlightNum - Flight Number
Origin - The airport code, PWM = Portland, BGR = Bangor
Dest - Destination airport code
DestCityName - Destination city name
DestState - Destination state
CRSDepTime - Scheduled departure time (local time: hhmm)
DepTime - Actual departure time (local time: hhmm)
WheelsOff - Wheels off time (local time: hhmm)
WheelsOn - Wheels on time (local time: hhmm)
CRSArrTime - Scheduled arrival time (local time: hhmm)
ArrTime - Actual arrival time (local time: hhmm)
Cancelled - Cancelled flight indicator (1 = yes)
Diverted - Diverted flight indicator (1 = yes)
CRSElapsedTime - Scheduled elapsed time of flight, in minutes
ActualElapsedTime - Actual elapsed time of flight, in minutes
Distance - Distance between airports, in miles
DepDelay - Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
DepDelayMinutes - Difference in minutes between scheduled and actual departure time. Early departures set to 0.
DepDel30 - Departure Delay Indicator, 30 Minutes or More (1=Yes)
TaxiOut - Taxi out time, in minutes
TaxiIn - Taxi in time, in minutes
ArrDelay - Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
ArrDelayMinutes - Difference in minutes between scheduled and actual arrival time. Early arrivals set to 0.
ArrDel30 - Arrival Delay Indicator, 30 Minutes or More (1=Yes)
FlightTimeBuffer - The difference between scheduled elapsed time and actual elapsed time.
AirTime - The time spent in the air.
AirSpeed - Average speed of the plane in flight (mph)
flights <- read.csv("domestic_flights_jan_2016.csv", header = TRUE, stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
library(knitr)
library(ggvis)
library(xtable)
library(utils)
library(pander)
library(chron)
library(broom)
Field Headings
names(flights)
## [1] "FlightDate" "Carrier" "TailNum"
## [4] "FlightNum" "Origin" "OriginCityName"
## [7] "OriginState" "Dest" "DestCityName"
## [10] "DestState" "CRSDepTime" "DepTime"
## [13] "WheelsOff" "WheelsOn" "CRSArrTime"
## [16] "ArrTime" "Cancelled" "Diverted"
## [19] "CRSElapsedTime" "ActualElapsedTime" "Distance"
Structure
str(flights)
## 'data.frame': 445827 obs. of 21 variables:
## $ FlightDate : chr "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N4YBAA" "N434AA" "N541AA" "N489AA" ...
## $ FlightNum : int 43 43 43 43 43 43 43 43 43 43 ...
## $ Origin : chr "DFW" "DFW" "DFW" "DFW" ...
## $ OriginCityName : chr "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
## $ OriginState : chr "TX" "TX" "TX" "TX" ...
## $ Dest : chr "DTW" "DTW" "DTW" "DTW" ...
## $ DestCityName : chr "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
## $ DestState : chr "MI" "MI" "MI" "MI" ...
## $ CRSDepTime : int 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ DepTime : int 1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
## $ WheelsOff : int 1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
## $ WheelsOn : int 1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
## $ CRSArrTime : int 1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
## $ ArrTime : int 1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 158 158 158 158 158 158 158 158 158 158 ...
## $ ActualElapsedTime: int 155 150 170 151 171 148 159 156 150 158 ...
## $ Distance : int 986 986 986 986 986 986 986 986 986 986 ...
Quick Check for records with Duplicate Tail Numbers.
pandoc.table(head(flights %>% count(length(unique(TailNum))), style = "grid"))
##
## --------------------------------
## length(unique(TailNum)) n
## ------------------------- ------
## 4239 445827
## --------------------------------
Confirmed records with duplicate tail numbers exist.
There are 445827 Records and 4239 Unique Tail Numbers
Unfortunately, this means I cannot join previously munged data with the new destination city and state fields.
Check the Number of Cancelled Flights
pandoc.table(flights %>% count(Cancelled == 1), style = 'grid')
##
##
## +------------------+--------+
## | Cancelled == 1 | n |
## +==================+========+
## | FALSE | 434162 |
## +------------------+--------+
## | TRUE | 11665 |
## +------------------+--------+
There are 11665 cancelled flights
Because this study is only interested in completed flights, the records for cancelled flights will be removed.
flights <- flights %>% filter(Cancelled == 0)
Re-check the Number of Cancelled Flights
pandoc.table(flights %>% count(Cancelled == 1), style = 'grid')
##
##
## +------------------+--------+
## | Cancelled == 1 | n |
## +==================+========+
## | FALSE | 434162 |
## +------------------+--------+
All cancelled flights have successfully been removed
Check the Number of Incomplete Cases
pandoc.table(flights %>% count(!complete.cases(.)), style = "grid")
##
##
## +----------------------+--------+
## | !complete.cases(.) | n |
## +======================+========+
## | FALSE | 433298 |
## +----------------------+--------+
## | TRUE | 864 |
## +----------------------+--------+
There are 864 incomplete cases
Sample the incomplete cases
head(flights %>% filter(!complete.cases(.)))
## FlightDate Carrier TailNum FlightNum Origin OriginCityName
## 1 1/15/2016 AA N3ALAA 56 DEN Denver, CO
## 2 1/15/2016 AA N3GUAA 208 SFO San Francisco, CA
## 3 1/10/2016 AA N3BVAA 210 LAS Las Vegas, NV
## 4 1/15/2016 AA N3HFAA 217 LAS Las Vegas, NV
## 5 1/10/2016 AA N796AA 34 LAX Los Angeles, CA
## 6 1/22/2016 AA N480AA 248 DFW Dallas/Fort Worth, TX
## OriginState Dest DestCityName DestState CRSDepTime DepTime WheelsOff
## 1 CO MIA Miami, FL FL 1045 1042 1059
## 2 CA MIA Miami, FL FL 640 638 656
## 3 NV JFK New York, NY NY 820 818 835
## 4 NV MIA Miami, FL FL 2359 2356 15
## 5 CA JFK New York, NY NY 800 759 813
## 6 TX BNA Nashville, TN TN 700 655 704
## WheelsOn CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime
## 1 1852 1639 1902 0 1 234
## 2 1719 1458 1728 0 1 318
## 3 1852 1615 1923 0 1 295
## 4 1630 725 1642 0 1 266
## 5 1906 1629 1923 0 1 329
## 6 730 847 736 0 1 107
## ActualElapsedTime Distance
## 1 NA 1709
## 2 NA 2585
## 3 NA 2248
## 4 NA 2174
## 5 NA 2475
## 6 NA 631
It appears the incomplete cases are records with “NA” valuse for ActualElapsedTime
Check the Number of Records with ActualElapsedTime == “NA”
pandoc.table(flights %>% count(ActualElapsedTime == "NA"), style = "grid")
##
##
## +-----------------------------+--------+
## | ActualElapsedTime == "NA" | n |
## +=============================+========+
## | FALSE | 433298 |
## +-----------------------------+--------+
## | NA | 864 |
## +-----------------------------+--------+
There are 864 Records with ActualElapsedTime entered as “NA”
Because this study is only interested in flights with ActualElapsedTime data, the records records with ‘NA’ valuse for ActualElapsedTime will be removed.
flights <- flights %>% filter(ActualElapsedTime != "NA")
Check the Number of Incomplete Cases
pandoc.table(flights %>% count(!complete.cases(.)), style = "grid")
##
##
## +----------------------+--------+
## | !complete.cases(.) | n |
## +======================+========+
## | FALSE | 433298 |
## +----------------------+--------+
All records now have complete data
Data Summary
summary(flights, 6)
## FlightDate Carrier TailNum FlightNum
## Length:433298 Length:433298 Length:433298 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 701
## Mode :character Mode :character Mode :character Median :1591
## Mean :2077
## 3rd Qu.:2762
## Max. :7438
## Origin OriginCityName OriginState
## Length:433298 Length:433298 Length:433298
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Dest DestCityName DestState CRSDepTime
## Length:433298 Length:433298 Length:433298 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 920
## Mode :character Mode :character Mode :character Median :1325
## Mean :1330
## 3rd Qu.:1730
## Max. :2359
## DepTime WheelsOff WheelsOn CRSArrTime
## Min. : 1 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 924 1st Qu.: 939 1st Qu.:1104 1st Qu.:1118
## Median :1331 Median :1344 Median :1519 Median :1527
## Mean :1334 Mean :1357 Mean :1483 Mean :1503
## 3rd Qu.:1737 3rd Qu.:1750 3rd Qu.:1914 3rd Qu.:1920
## Max. :2400 Max. :2400 Max. :2400 Max. :2359
## ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime
## Min. : 1 Min. :0 Min. :0 Min. : 21.0 Min. : 15.0
## 1st Qu.:1108 1st Qu.:0 1st Qu.:0 1st Qu.: 90.0 1st Qu.: 85.0
## Median :1522 Median :0 Median :0 Median :128.0 Median :122.0
## Mean :1488 Mean :0 Mean :0 Mean :146.4 Mean :140.1
## 3rd Qu.:1919 3rd Qu.:0 3rd Qu.:0 3rd Qu.:180.0 3rd Qu.:173.0
## Max. :2400 Max. :0 Max. :0 Max. :705.0 Max. :721.0
## Distance
## Min. : 31.0
## 1st Qu.: 391.0
## Median : 679.0
## Mean : 843.8
## 3rd Qu.:1089.0
## Max. :4983.0
Check for (ArrTime > DepTime) to test for Arrival Times that may occure before Departure times.
#Checking for neagative flight times
pandoc.table(flights %>% count(ArrTime > DepTime), style = "grid")
##
##
## +---------------------+--------+
## | ArrTime > DepTime | n |
## +=====================+========+
## | FALSE | 16172 |
## +---------------------+--------+
## | TRUE | 417126 |
## +---------------------+--------+
Convert the FlightDate field to a Date clase
#This code chunk was taken directly from the lecure notes.
#Converts FlightDate to a Date class
flights$FlightDate <- as.Date(flights$FlightDate, format = "%m/%d/%Y")
Using the sprintf() function to add leading zeros to allow creation of date/time variables
The
#This code chunk was taken directly from the lecure notes.
#Add leading zeros to allow creation of date/time variables, then paste the Date information from FlightDate
flights <- flights %>%
mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
flights$new_CRSDepTime <- as.POSIXct(flights$new_CRSDepTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(CRSDepTime, new_CRSDepTime), style = "grid"))
##
## --------------------------------
## CRSDepTime new_CRSDepTime
## ------------ -------------------
## 1100 2016-01-06 11:00:00
##
## 1100 2016-01-07 11:00:00
##
## 1100 2016-01-08 11:00:00
##
## 1100 2016-01-09 11:00:00
##
## 1100 2016-01-10 11:00:00
##
## 1100 2016-01-11 11:00:00
## --------------------------------
The sprintf() function is repeated for: DepTime, WheelsOff, WheelsOn, CRSArrTime, and ArrTime
Note: It is not necessary to apply the sprintf() function to CRSElapsedTime and ActualElapsedTime They are both stored as “minutes”.
#DepTime
flights <- flights %>%
mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)))
flights$new_DepTime <- as.POSIXct(flights$new_DepTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(DepTime, new_DepTime), style = "grid"))
##
## -----------------------------
## DepTime new_DepTime
## --------- -------------------
## 1057 2016-01-06 10:57:00
##
## 1056 2016-01-07 10:56:00
##
## 1055 2016-01-08 10:55:00
##
## 1102 2016-01-09 11:02:00
##
## 1240 2016-01-10 12:40:00
##
## 1107 2016-01-11 11:07:00
## -----------------------------
#WheelsOff
flights <- flights %>%
mutate(new_WheelsOff = paste(FlightDate, sprintf("%04d", WheelsOff)))
flights$new_WheelsOff <- as.POSIXct(flights$new_WheelsOff, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(WheelsOff, new_WheelsOff), style = "grid"))
##
## -------------------------------
## WheelsOff new_WheelsOff
## ----------- -------------------
## 1112 2016-01-06 11:12:00
##
## 1110 2016-01-07 11:10:00
##
## 1116 2016-01-08 11:16:00
##
## 1115 2016-01-09 11:15:00
##
## 1300 2016-01-10 13:00:00
##
## 1118 2016-01-11 11:18:00
## -------------------------------
#WheelsOn
flights <- flights %>%
mutate(new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)))
flights$new_WheelsOn <- as.POSIXct(flights$new_WheelsOn, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(WheelsOn, new_WheelsOn), style = "grid"))
##
## ------------------------------
## WheelsOn new_WheelsOn
## ---------- -------------------
## 1424 2016-01-06 14:24:00
##
## 1416 2016-01-07 14:16:00
##
## 1431 2016-01-08 14:31:00
##
## 1424 2016-01-09 14:24:00
##
## 1617 2016-01-10 16:17:00
##
## 1426 2016-01-11 14:26:00
## ------------------------------
#CRSArrTime
flights <- flights %>%
mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
flights$new_CRSArrTime <- as.POSIXct(flights$new_CRSArrTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(CRSArrTime, new_CRSArrTime), style = "grid"))
##
## --------------------------------
## CRSArrTime new_CRSArrTime
## ------------ -------------------
## 1438 2016-01-06 14:38:00
##
## 1438 2016-01-07 14:38:00
##
## 1438 2016-01-08 14:38:00
##
## 1438 2016-01-09 14:38:00
##
## 1438 2016-01-10 14:38:00
##
## 1438 2016-01-11 14:38:00
## --------------------------------
#ArrTime
flights <- flights %>%
mutate(new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
flights$new_ArrTime <- as.POSIXct(flights$new_ArrTime, format="%Y-%m-%d %H%M")
pandoc.table(head(flights %>% select(ArrTime, new_ArrTime), style = "grid"))
##
## -----------------------------
## ArrTime new_ArrTime
## --------- -------------------
## 1432 2016-01-06 14:32:00
##
## 1426 2016-01-07 14:26:00
##
## 1445 2016-01-08 14:45:00
##
## 1433 2016-01-09 14:33:00
##
## 1631 2016-01-10 16:31:00
##
## 1435 2016-01-11 14:35:00
## -----------------------------
In this section, new fields are created using the following calculations:
DepDelay = new_DepTime - new_CRSDepTime
DepDelayMinutes - ifelse(DepDelay < 0, 0, DepDelay)
DepDel15 - ifelse(DepDelay >= 15, 1, 0)
TaxiOut - new_WheelsOff - new_DepTime
TaxiIn - new_ArrTime - new_WheelsOn
ArrDelay - new_ArrTime - new_ArrDepTime
ArrDelayMinutes - ifelse(ArrDelay < 0, 0, ArrDelay)
ArrDel15 - ifelse(ArrDelay >= 15, 1, 0)
FlightTimeBuffer - CRSElapsedTime - ActualElapsedTime
AirTime - ActualElapsedTime - TaxiOut - TaxiIn
AirSpeed - Distance / (Airtime / 60)
Using the difftime() function to calculate the differnce between two date/time objects.
The as.interger() function is used to store the result as in integer.
flights <- flights %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units = "mins")))
pandoc.table(head(flights %>% select(CRSDepTime, DepTime, DepDelay), style = "grid"))
##
## ---------------------------------
## CRSDepTime DepTime DepDelay
## ------------ --------- ----------
## 1100 1057 -3
##
## 1100 1056 -4
##
## 1100 1055 -5
##
## 1100 1102 2
##
## 1100 1240 100
##
## 1100 1107 7
## ---------------------------------
Using the ifelse() function to separate flights with delays into two categories.
Variable DepDelayMinutes for delays less then 30 minutes and DeDel15 for delays 30 minutes or more.
flights <- flights %>% mutate(DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay),
DepDel30 = ifelse(DepDelay >= 30, 1, 0))
pandoc.table(head(flights %>% select(DepDelay, DepDelayMinutes, DepDel30), style = "grid"))
##
## ---------------------------------------
## DepDelay DepDelayMinutes DepDel30
## ---------- ----------------- ----------
## -3 0 0
##
## -4 0 0
##
## -5 0 0
##
## 2 2 0
##
## 100 100 1
##
## 7 7 0
## ---------------------------------------
Check the number of flights with more than a 30 minute delay.
pandoc.table(flights %>% count(DepDel30 == 1), style = 'grid')
##
##
## +-----------------+--------+
## | DepDel30 == 1 | n |
## +=================+========+
## | FALSE | 391307 |
## +-----------------+--------+
## | TRUE | 41991 |
## +-----------------+--------+
There are 41991 record for flights with a delay of 30 minutes or more
Calculating the inboud and outboud taxi times and in minutes.
#TaxiOut
#TaxiIn
flights <- flights %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")),
TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")))
pandoc.table(head(flights %>% select(TaxiIn, TaxiOut), style = "grid"))
##
## ------------------
## TaxiIn TaxiOut
## -------- ---------
## 8 15
##
## 10 14
##
## 14 21
##
## 9 13
##
## 14 20
##
## 9 11
## ------------------
Calculating the difference in minutes between scheduled and actual arrival time.
Early arrivals show negative numbers.
#ArrDelay
flights <- flights %>% mutate(ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")))
pandoc.table(head(flights %>% select(ArrDelay), style = "grid"))
##
## ----------
## ArrDelay
## ----------
## -6
##
## -12
##
## 7
##
## -5
##
## 113
##
## -3
## ----------
Calculating the difference in minutes between scheduled and actual arrival time with early arrivals set to 0.
Creating an Arrival Delay Indicator, 15 Minutes or More (1=Yes)
#ArrDelayMinutes
#ArrDel30
flights <- flights %>% mutate( ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay),
ArrDel30 = ifelse(ArrDelay >= 30, 1, 0))
pandoc.table(head(flights %>% select(ArrDelayMinutes, ArrDel30), style = "grid"))
##
## ----------------------------
## ArrDelayMinutes ArrDel30
## ----------------- ----------
## 0 0
##
## 0 0
##
## 7 0
##
## 0 0
##
## 113 1
##
## 0 0
## ----------------------------
FlightTimeBuffer: The difference between scheduled elapsed time and actual elapsed time.
AirTime: The time spent in the air.
AirSpeed: Average speed of the plane in flight (mph)
#FlightTimeBuffer
flights <- flights %>% mutate(FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)
#AirTime
flights <- flights %>% mutate(AirTime = ActualElapsedTime - TaxiOut - TaxiIn)
#AirSpeed
flights <- flights %>% mutate(AirSpeed = Distance / (AirTime / 60))
pandoc.table(head(flights %>% select(FlightTimeBuffer, AirTime, AirSpeed), style = "grid"))
##
## ---------------------------------------
## FlightTimeBuffer AirTime AirSpeed
## ------------------ --------- ----------
## 3 132 448.2
##
## 8 126 469.5
##
## -12 135 438.2
##
## 7 129 458.6
##
## -13 137 431.8
##
## 10 128 462.2
## ---------------------------------------
Compare Taxi Time, Air Time, and Air Speed for departure dealys.
Query Delayed Flights for Analysis
# Create the new data frame: delays
# Query flights with 30 minute or more departure delays
delays <- flights %>% select(FlightDate, Carrier, Origin, OriginState, Dest, TaxiOut, TaxiIn, AirTime, AirSpeed, DestState, DepDelay, DepDel30) %>%
filter(DepDel30 == 1) %>% arrange(FlightDate)
kable(head(delays[1:9]), align = 'c')
| FlightDate | Carrier | Origin | OriginState | Dest | TaxiOut | TaxiIn | AirTime | AirSpeed |
|---|---|---|---|---|---|---|---|---|
| 2016-01-01 | AA | LAX | CA | DCA | 17 | 4 | 253 | 548.0632 |
| 2016-01-01 | AA | DFW | TX | MIA | 16 | 4 | 124 | 542.4194 |
| 2016-01-01 | AA | MIA | FL | DFW | 12 | 8 | 159 | 423.0189 |
| 2016-01-01 | AA | MCO | FL | JFK | 18 | 5 | 117 | 484.1026 |
| 2016-01-01 | AA | SLC | UT | DFW | 10 | 5 | 122 | 486.3934 |
| 2016-01-01 | AA | DFW | TX | SLC | 11 | 5 | 137 | 433.1387 |
After reviewing the data queries, I realized some flights have negative taxi time and will need to be removed.
Records with airtimes greater than 600 minutes will also be removed.
negTaxiOut <- flights %>% filter(TaxiOut <= 0)
nrow(negTaxiOut)
## [1] 1088
negTaxiIn <- flights %>% filter(TaxiOut <= 0)
nrow(negTaxiIn)
## [1] 1088
longAirTime <- flights %>% filter(AirTime > 600)
nrow(longAirTime)
## [1] 2512
Removing 1088 Records with negative taxi Times and 2512 Records with airtimes greater than 600 minutes.
delays <- delays %>% filter(TaxiOut > 0) %>% filter(TaxiOut > 0) %>% filter(AirTime < 600)
With the suspicious data removed, I can revisit my regression models.
The plot comparisons look much better than before.
# Plot matrix of all variables.
plot(delays %>% select(TaxiOut, TaxiIn, AirTime),
pch=9, col="blue", main="Matrix Scatterplot of TaxiOut, TaxiIn, and AirTime")
Linear Regression Models to predict the conditional probability distribution of departure delays based on:
Taxi Time and Air Time
#Carrier
TaxiOut.lm1 <- lm(DepDelay ~TaxiOut, data = delays)
#TaxiIn
TaxiIn.lm1 <- lm(DepDelay ~TaxiIn, data = delays)
#AirTime
AirTime.lm1 <- lm(DepDelay ~AirTime, data = delays)
Create a data frame to view the results,
Predictor <- c("TaxiOut", "TaxiIn", "AirTime")
rSquared <- c(glance(TaxiOut.lm1)$r.squared,
glance(TaxiIn.lm1)$r.squared,
glance(AirTime.lm1)$r.squared)
pValue <- c(glance(TaxiOut.lm1)$p.value,
glance(TaxiIn.lm1)$p.value,
glance(AirTime.lm1)$p.value)
FitTable <- data.frame(Predictor, rSquared, pValue)
pandoc.table(FitTable, style = "grid")
##
##
## +-------------+------------+-----------+
## | Predictor | rSquared | pValue |
## +=============+============+===========+
## | TaxiOut | 0.003515 | 1.529e-33 |
## +-------------+------------+-----------+
## | TaxiIn | 0.001346 | 8.29e-14 |
## +-------------+------------+-----------+
## | AirTime | 3.318e-05 | 0.2413 |
## +-------------+------------+-----------+
The R^2 Values Clearly indicate there is no goodness of fit.
We can plot the TaxiOut time and clearly see it is not a good predictor of Departure Delays.
delays %>%
ggvis(~TaxiOut, ~DepDelay) %>%
layer_points(fill := "blue", size := 10) %>%
layer_model_predictions(model = "lm", se = TRUE, stroke := "red")
Now I will check the conditional probability distribution of air speed based on departure delays.
But first I am going to narrow the delay time to a maximum of 200 minutes.
delays2 <- delays %>% filter(DepDelay < 200)
Linear Regression Model
#AirSpeed
AirSpeed.lm2 <- lm(AirSpeed ~DepDelay, data = delays2)
Create a data frame to view the results,
Predictor <- c("AirSpeed")
rSquared <- c(glance(AirSpeed.lm2)$r.squared)
pValue <- c(glance(AirSpeed.lm2)$p.value)
FitTable2 <- data.frame(Predictor, rSquared, pValue)
pandoc.table(FitTable2, style = "grid")
##
##
## +-------------+------------+----------+
## | Predictor | rSquared | pValue |
## +=============+============+==========+
## | AirSpeed | 0.001414 | 8.66e-14 |
## +-------------+------------+----------+
The R^2 Values Clearly indicate there is no goodness of fit.
We can plot the TaxiOut time and clearly see it is not a good predictor of Departure Delays.
delays2 %>%
ggvis(~DepDelay, ~AirSpeed) %>%
layer_points(fill := "blue", size := 10) %>%
layer_model_predictions(model = "lm", se = TRUE, stroke := "red")
Still not a good fit.