Predicting Arrival Delay for Selected US Airports-Q1 2018
Predicting Arrival Delay for Selected US Airports-Q1 2018
- Part-1 Data loading and processing
- More info on dataset
- Missing values
- Converting the date in year-month-date format
- Generating day of week column
- Filtering on ORD and ATL destination airports
- Removing negative arrival and departure delay
- Frequency of origin airports
- Binning origin airports using case statement
- Joining the two datasets to add a new origin frequency column
- Adding month column
- Adding levels to categorical columns
- Selecting final columns
- Part-2 Data summarization and preparation
- Part-3 Unique Carrier delays
- Part-4 Data Imputation
- Part-5 Inferential Regression
- Part-6 Predictive Modeling
Part-1 Data loading and processing
master_data <- read_excel("~/master_data.xlsx")More info on dataset
df_status(master_data)## variable q_zeros p_zeros q_na p_na q_inf p_inf type
## 1 fl_date 0 0.00 0 0.00 0 0 character
## 2 op_unique_carrier 0 0.00 0 0.00 0 0 character
## 3 tail_num 0 0.00 397 0.24 0 0 character
## 4 op_carrier_fl_num 0 0.00 0 0.00 0 0 numeric
## 5 origin 0 0.00 0 0.00 0 0 character
## 6 dest 0 0.00 0 0.00 0 0 character
## 7 dep_time 0 0.00 3597 2.16 0 0 numeric
## 8 dep_delay 7494 4.51 3769 2.27 0 0 numeric
## 9 taxi_out 0 0.00 3662 2.20 0 0 numeric
## 10 taxi_in 0 0.00 3674 2.21 0 0 numeric
## 11 arr_time 0 0.00 3674 2.21 0 0 numeric
## 12 arr_delay 2711 1.63 3943 2.37 0 0 numeric
## 13 cancelled 162527 97.79 0 0.00 0 0 numeric
## 14 actual_elapsed_time 0 0.00 3849 2.32 0 0 numeric
## 15 air_time 0 0.00 3849 2.32 0 0 numeric
## 16 distance 0 0.00 0 0.00 0 0 numeric
## 17 carrier_delay 15073 9.07 140602 84.60 0 0 numeric
## 18 weather_delay 23853 14.35 140602 84.60 0 0 numeric
## 19 nas_delay 10806 6.50 140602 84.60 0 0 numeric
## 20 security_delay 25539 15.37 140602 84.60 0 0 numeric
## 21 late_aircraft_delay 12707 7.65 140602 84.60 0 0 numeric
## unique
## 1 89
## 2 17
## 3 4712
## 4 3818
## 5 195
## 6 3
## 7 1297
## 8 698
## 9 152
## 10 109
## 11 1406
## 12 734
## 13 2
## 14 434
## 15 413
## 16 287
## 17 558
## 18 290
## 19 321
## 20 32
## 21 393
Missing values
plot_missing(master_data)Last 5 variables has over 85% missing values which should be removed as per the plot
Converting the date in year-month-date format
master_data <- master_data%>%mutate(fl_date=ymd(fl_date))Generating day of week column
master_data <- master_data%>%mutate(dow=wday(fl_date,label=T))
ggplot(master_data,aes(x=master_data$dow))+geom_histogram(stat='count')## Warning: Ignoring unknown parameters: binwidth, bins, pad
Filtering on ORD and ATL destination airports
md_pred <- master_data%>%filter(dest %in% c("ORD", "ATL"))We create a new dataframe
Removing negative arrival and departure delay
md_pred <- md_pred%>%filter(arr_delay> 0 & dep_delay>0)
dim(md_pred)## [1] 31820 22
As negative values in both arrival and departure delay means before time arrival and departure, thus removing these values will make more sense. After removal of these values, only 31820 rows are left
Frequency of origin airports
origin <- md_pred%>%group_by(origin)%>%summarise(cnt=n())%>%arrange(desc(cnt))
kable(head(origin,194),'html') %>%
kable_styling() %>%
scroll_box(width = "400px", height = "700px")| origin | cnt |
|---|---|
| LGA | 1195 |
| DFW | 784 |
| MCO | 673 |
| LAX | 637 |
| DTW | 627 |
| MSP | 597 |
| MIA | 581 |
| BOS | 578 |
| FLL | 569 |
| DCA | 560 |
| DEN | 526 |
| EWR | 520 |
| CLT | 516 |
| IAH | 511 |
| TPA | 457 |
| PHL | 453 |
| BWI | 444 |
| RDU | 439 |
| SFO | 434 |
| CLE | 418 |
| ORD | 408 |
| CMH | 407 |
| IND | 405 |
| PHX | 380 |
| STL | 368 |
| MSY | 361 |
| BNA | 360 |
| ATL | 354 |
| CVG | 348 |
| SLC | 335 |
| LAS | 327 |
| PBI | 313 |
| MCI | 296 |
| GRR | 289 |
| AUS | 274 |
| JFK | 271 |
| JAX | 264 |
| MKE | 263 |
| PIT | 258 |
| RIC | 256 |
| RSW | 239 |
| MDW | 236 |
| SEA | 236 |
| MEM | 223 |
| FWA | 219 |
| DAL | 207 |
| HOU | 207 |
| SAT | 202 |
| BHM | 200 |
| SDF | 198 |
| AGS | 192 |
| CID | 192 |
| DSM | 192 |
| IAD | 192 |
| XNA | 190 |
| DAY | 185 |
| ORF | 185 |
| SAN | 181 |
| SBN | 181 |
| ASE | 178 |
| CHA | 176 |
| AVL | 170 |
| GSP | 170 |
| BUF | 165 |
| MGM | 165 |
| MSN | 163 |
| OMA | 159 |
| SGF | 159 |
| SYR | 157 |
| ICT | 154 |
| CHS | 153 |
| EYW | 152 |
| BDL | 147 |
| GNV | 144 |
| OKC | 139 |
| TUL | 139 |
| CHO | 135 |
| MOB | 135 |
| FAR | 134 |
| ALB | 131 |
| CAE | 130 |
| TYS | 130 |
| MLI | 129 |
| LIT | 128 |
| ROA | 128 |
| BMI | 121 |
| BTR | 121 |
| PIA | 121 |
| RST | 121 |
| LEX | 120 |
| SAV | 115 |
| FAY | 112 |
| MDT | 109 |
| ROC | 109 |
| GRB | 105 |
| HSV | 103 |
| HPN | 100 |
| ILM | 100 |
| SNA | 100 |
| SRQ | 99 |
| FSD | 97 |
| CRW | 96 |
| MLU | 96 |
| GSO | 95 |
| SHV | 95 |
| CAK | 94 |
| FNT | 94 |
| DHN | 90 |
| TRI | 89 |
| EVV | 88 |
| TLH | 87 |
| CMI | 83 |
| SJU | 81 |
| TVC | 78 |
| ABE | 76 |
| LSE | 76 |
| PNS | 76 |
| BZN | 72 |
| COS | 69 |
| PDX | 66 |
| ABY | 65 |
| CSG | 65 |
| MBS | 65 |
| SJC | 65 |
| ATW | 64 |
| ELP | 64 |
| PVD | 64 |
| SPI | 64 |
| VLD | 64 |
| EGE | 63 |
| JAN | 63 |
| TUS | 63 |
| VPS | 62 |
| BTV | 60 |
| DAB | 60 |
| ECP | 60 |
| AEX | 59 |
| LFT | 57 |
| DBQ | 53 |
| MLB | 51 |
| PHF | 51 |
| HDN | 50 |
| CWA | 47 |
| SMF | 47 |
| ABQ | 46 |
| BOI | 46 |
| COU | 44 |
| DLH | 44 |
| MYR | 44 |
| GTR | 43 |
| OAJ | 43 |
| TOL | 43 |
| CMX | 40 |
| GPT | 39 |
| UIN | 39 |
| PSP | 38 |
| SUX | 38 |
| PAH | 37 |
| ALO | 35 |
| BQK | 35 |
| JAC | 34 |
| MKG | 33 |
| AZO | 32 |
| EAU | 32 |
| AVP | 30 |
| FSM | 30 |
| LAN | 30 |
| RNO | 29 |
| MEI | 27 |
| EWN | 26 |
| MHK | 26 |
| BIS | 24 |
| MQT | 23 |
| HNL | 21 |
| LNK | 21 |
| MHT | 21 |
| OGG | 21 |
| PWM | 21 |
| CKB | 19 |
| OAK | 18 |
| RAP | 16 |
| CGI | 15 |
| STX | 9 |
| GEG | 8 |
| STT | 7 |
| MTJ | 6 |
| ANC | 5 |
| FCA | 5 |
| GRK | 5 |
| TTN | 4 |
| TWF | 3 |
| ELM | 1 |
| SUN | 1 |
Binning origin airports using case statement
origin <- origin%>%filter(cnt>=10)%>%mutate(origin_freq=case_when(cnt >=175 & cnt<300 ~ 'Low' ,
cnt>=300 & cnt<500 ~ 'Moderate',
cnt>=500 ~ 'High',
TRUE ~ 'Very Low'))We create a new column based on the frequency of the flights from the origin airport
kable(head(origin,25),'html') %>%
kable_styling() %>%
scroll_box(width = "400px", height = "700px")| origin | cnt | origin_freq |
|---|---|---|
| LGA | 1195 | High |
| DFW | 784 | High |
| MCO | 673 | High |
| LAX | 637 | High |
| DTW | 627 | High |
| MSP | 597 | High |
| MIA | 581 | High |
| BOS | 578 | High |
| FLL | 569 | High |
| DCA | 560 | High |
| DEN | 526 | High |
| EWR | 520 | High |
| CLT | 516 | High |
| IAH | 511 | High |
| TPA | 457 | Moderate |
| PHL | 453 | Moderate |
| BWI | 444 | Moderate |
| RDU | 439 | Moderate |
| SFO | 434 | Moderate |
| CLE | 418 | Moderate |
| ORD | 408 | Moderate |
| CMH | 407 | Moderate |
| IND | 405 | Moderate |
| PHX | 380 | Moderate |
| STL | 368 | Moderate |
Joining the two datasets to add a new origin frequency column
md_pred <- left_join(md_pred,origin,by='origin')
md_pred <- md_pred%>%filter(!is.na(origin_freq))We joined the md_pred with origin dataset using left join and on origin column
Adding month column
md_pred <- md_pred%>%mutate(Month=lubridate::month(ymd(fl_date),label=T))Adding levels to categorical columns
md_pred <- md_pred %>%
mutate_at(vars(op_unique_carrier,dest,origin_freq, Month,dow), funs(as.factor))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
md_pred$Month <- factor(md_pred$Month, levels = c("Jan","Feb","Mar"), labels =c(1,2,3),
ordered = F )
md_pred$dow <- factor(md_pred$dow, levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"),
labels =c(1,2,3,4,5,6,7),ordered = F )
md_pred$origin_freq <- factor(md_pred$origin_freq, levels = c("Very Low","Low","Moderate","High"),
labels =c(1,2,3,4) )
md_pred$dest <- factor(md_pred$dest, levels = c("ORD","ATL"),
labels =c(1,2) )
md_pred$op_unique_carrier <- factor(md_pred$op_unique_carrier,
levels = c("F9","EV","MQ","NK","OH","OO","VX","WN","YV","YX","AA", "AS", "DL", "9E", "B6", "UA"),
labels =c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16) )
df_status(md_pred)## variable q_zeros p_zeros q_na p_na q_inf p_inf type
## 1 fl_date 0 0.00 0 0.00 0 0 Date
## 2 op_unique_carrier 0 0.00 0 0.00 0 0 factor
## 3 tail_num 0 0.00 0 0.00 0 0 character
## 4 op_carrier_fl_num 0 0.00 0 0.00 0 0 numeric
## 5 origin 0 0.00 0 0.00 0 0 character
## 6 dest 0 0.00 0 0.00 0 0 factor
## 7 dep_time 0 0.00 0 0.00 0 0 numeric
## 8 dep_delay 0 0.00 0 0.00 0 0 numeric
## 9 taxi_out 0 0.00 0 0.00 0 0 numeric
## 10 taxi_in 0 0.00 0 0.00 0 0 numeric
## 11 arr_time 0 0.00 0 0.00 0 0 numeric
## 12 arr_delay 0 0.00 0 0.00 0 0 numeric
## 13 cancelled 31766 100.00 0 0.00 0 0 numeric
## 14 actual_elapsed_time 0 0.00 0 0.00 0 0 numeric
## 15 air_time 0 0.00 0 0.00 0 0 numeric
## 16 distance 0 0.00 0 0.00 0 0 numeric
## 17 carrier_delay 10871 34.22 10411 32.77 0 0 numeric
## 18 weather_delay 19618 61.76 10411 32.77 0 0 numeric
## 19 nas_delay 10773 33.91 10411 32.77 0 0 numeric
## 20 security_delay 21303 67.06 10411 32.77 0 0 numeric
## 21 late_aircraft_delay 8496 26.75 10411 32.77 0 0 numeric
## 22 dow 0 0.00 0 0.00 0 0 factor
## 23 cnt 0 0.00 0 0.00 0 0 integer
## 24 origin_freq 0 0.00 0 0.00 0 0 factor
## 25 Month 0 0.00 0 0.00 0 0 factor
## unique
## 1 89
## 2 16
## 3 4230
## 4 3103
## 5 182
## 6 2
## 7 1263
## 8 660
## 9 147
## 10 100
## 11 1331
## 12 668
## 13 1
## 14 331
## 15 308
## 16 275
## 17 557
## 18 289
## 19 321
## 20 32
## 21 393
## 22 7
## 23 128
## 24 4
## 25 3
Converted op_unique_carrier, origin_freq, Month, dow and dest columns to factors
Selecting final columns
md_pred <- md_pred%>%select(2,6,8,12,14,16:22,24,25)Selecting appropriate columns for predicting arrival delay
Part-2 Data summarization and preparation
skimr::skim(md_pred)## Skim summary statistics
## n obs: 31766
## n variables: 14
##
## ── Variable type:factor ───────────────────────────────────────────────────
## variable missing complete n n_unique
## dest 0 31766 31766 2
## dow 0 31766 31766 7
## Month 0 31766 31766 3
## op_unique_carrier 0 31766 31766 16
## origin_freq 0 31766 31766 4
## top_counts ordered
## 2: 16957, 1: 14809, NA: 0 FALSE
## 1: 5878, 3: 5038, 4: 4810, 2: 4567 FALSE
## 1: 12062, 3: 9934, 2: 9770, NA: 0 FALSE
## 13: 9467, 6: 5874, 11: 3297, 3: 2943 FALSE
## 1: 9499, 4: 8874, 3: 7011, 2: 6382 FALSE
##
## ── Variable type:numeric ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75
## actual_elapsed_time 0 31766 31766 128.61 52.69 37 90 117 156
## arr_delay 0 31766 31766 57.21 95.45 1 11 27 65
## carrier_delay 10411 21355 31766 26.41 88.33 0 0 0 16
## dep_delay 0 31766 31766 58.46 95.01 1 13 29 66
## distance 0 31766 31766 660.87 463.9 67 334 584 763
## late_aircraft_delay 10411 21355 31766 33.47 58 0 0 14 42
## nas_delay 10411 21355 31766 16.47 42.98 0 0 0 16
## security_delay 10411 21355 31766 0.059 2.24 0 0 0 0
## weather_delay 10411 21355 31766 5.31 39.04 0 0 0 0
## p100 hist
## 539 ▆▇▃▁▁▁▁▁
## 1449 ▇▁▁▁▁▁▁▁
## 1390 ▇▁▁▁▁▁▁▁
## 1439 ▇▁▁▁▁▁▁▁
## 4502 ▇▃▁▁▁▁▁▁
## 1431 ▇▁▁▁▁▁▁▁
## 1161 ▇▁▁▁▁▁▁▁
## 198 ▇▁▁▁▁▁▁▁
## 1352 ▇▁▁▁▁▁▁▁
We see there are 5 factor columns and 9 numerical columns mentioned with extensive information about both the variable types. Many variables are right skewed along with arrival delay. The skewness needs to be addressed.
Data preparation
recipe_obj <- recipe(arr_delay ~ ., data = md_pred) %>%
step_zv(all_predictors()) %>%
step_YeoJohnson(nas_delay,late_aircraft_delay,security_delay,weather_delay,carrier_delay,arr_delay)%>%
prep()We solved the problem of zero variance columns which has no information and transformed the highly skewed varibles including arrival delay column using Yeo Johnson transformation
md_pred_final dataset
md_pred_final <- bake(recipe_obj, new_data = md_pred)
skimr::skim(md_pred_final)## Skim summary statistics
## n obs: 31766
## n variables: 14
##
## ── Variable type:factor ───────────────────────────────────────────────────
## variable missing complete n n_unique
## dest 0 31766 31766 2
## dow 0 31766 31766 7
## Month 0 31766 31766 3
## op_unique_carrier 0 31766 31766 16
## origin_freq 0 31766 31766 4
## top_counts ordered
## 2: 16957, 1: 14809, NA: 0 FALSE
## 1: 5878, 3: 5038, 4: 4810, 2: 4567 FALSE
## 1: 12062, 3: 9934, 2: 9770, NA: 0 FALSE
## 13: 9467, 6: 5874, 11: 3297, 3: 2943 FALSE
## 1: 9499, 4: 8874, 3: 7011, 2: 6382 FALSE
##
## ── Variable type:numeric ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## actual_elapsed_time 0 31766 31766 128.61 52.69 37 90
## arr_delay 0 31766 31766 3.27 1.22 0.69 2.46
## carrier_delay 10411 21355 31766 0.85 0.92 0 0
## dep_delay 0 31766 31766 58.46 95.01 1 13
## distance 0 31766 31766 660.87 463.9 67 334
## late_aircraft_delay 10411 21355 31766 2.06 1.83 0 0
## nas_delay 10411 21355 31766 0.87 0.94 0 0
## security_delay 10411 21355 31766 0.059 2.24 0 0
## weather_delay 10411 21355 31766 0.022 0.073 0 0
## p50 p75 p100 hist
## 117 156 539 ▆▇▃▁▁▁▁▁
## 3.29 4.13 7.1 ▂▅▇▇▆▃▁▁
## 0 1.78 2.58 ▇▁▁▁▂▂▂▁
## 29 66 1439 ▇▁▁▁▁▁▁▁
## 584 763 4502 ▇▃▁▁▁▁▁▁
## 2.64 3.63 6.78 ▇▁▁▅▃▂▁▁
## 0 1.83 2.71 ▇▁▁▁▂▂▂▁
## 0 0 198 ▇▁▁▁▁▁▁▁
## 0 0 0.27 ▇▁▁▁▁▁▁▁
Creating a new dataset md_pred_final for final predictive modeling. We used reciepe obj on md_pred_final dataset. We can observe arrival delay skewness is cured. The missing values of the five delay columns came down from 85% to 33%
Part-3 Unique Carrier delays
Most frequent carriers on monthly basis
ggplot(data = md_pred_final) +
aes(x = op_unique_carrier) +
geom_bar(color = "black", fill = "lightblue") + facet_grid(Month~.)Delta airlines (13) has most flights per month followed by Skywest airlines(6)
Average delay time per month for unique carrier
a <- md_pred_final%>%group_by(Month,op_unique_carrier)%>%summarise(avg=mean(arr_delay))
md_pred_final%>%group_by(Month)%>%summarise(avg=mean(arr_delay))## # A tibble: 3 x 2
## Month avg
## <fct> <dbl>
## 1 1 3.40
## 2 2 3.35
## 3 3 3.03
p <- ggplot(data = a,aes(x = op_unique_carrier,y=avg) )+
geom_bar(stat = 'identity',fill='#807dba') + facet_grid(Month~.)
p + geom_hline(aes(yintercept = avg), data.frame(Month = c(1, 2, 3), avg = c(3.40,3.35, 3.03)))The average delay time for month of January is highest in the first quarter which makes sense as people are on holidays and are often travelling. The delay reduces gradually. The horizontal black line is the reference line for average arrival delay for that month. Delta airlines has average arrival delay time for the month of January and it reduces slightly below average for rest of the months. Skywest airline has more than average delay time in all three months. Southwest airline has significant below average arrival delays for all the three months and has decent frequency.
Carrier frequency on a daily basis
ggplot(data = md_pred_final) +
aes(x = op_unique_carrier) +
geom_bar(color = "black", fill = "lightblue") + facet_grid(dow~.)Average delay time per day for 16 unique carriers
b <- md_pred_final%>%group_by(dow,op_unique_carrier)%>%summarise(avg=mean(arr_delay))
md_pred_final%>%group_by(dow)%>%summarise(avg=mean(arr_delay))## # A tibble: 7 x 2
## dow avg
## <fct> <dbl>
## 1 1 3.30
## 2 2 3.24
## 3 3 3.47
## 4 4 3.14
## 5 5 3.25
## 6 6 3.07
## 7 7 3.31
p <- ggplot(data = b,aes(x = op_unique_carrier,y=avg) )+
geom_bar(stat = 'identity',fill='#807dba') + facet_grid(dow~.)
p + geom_hline(aes(yintercept = avg), data.frame(dow = c(1, 2, 3,4,5,6,7), avg = c(3.30, 3.24, 3.47,3.14, 3.25,3.07,3.31)))Wednesdays (3rd grid) has highest average arrival delays
Part-4 Data Imputation
Imputing missing values using Random Forest
class(md_pred)## [1] "tbl_df" "tbl" "data.frame"
#md_df <- data.frame(md_pred_final)
#md_pred_imp<- missForest(md_df,maxiter = 2)As five columns namely carrier_delay, late_aircraft_delay, nas_delay, security_delay and weather_delay has 33% missing values, we need to impute them. Random forest is used for the imputation
#md_pred_impute <- md_pred_imp$ximp
#kable(head(md_pred_impute,50),'html') %>%
#kable_styling() %>%
#scroll_box(width = "800px", height = "700px")Data summarization of imputed dataset
#skimr::skim(md_pred_impute)Negative values are imputed in security delay and weather_delay. I would thus not like to use this dataset as negative values in delays column don’t make much sense.
Part-5 Inferential Regression
reg<- lm(arr_delay~.,data=md_pred_final,na.action = na.omit)
summary(reg)##
## Call:
## lm(formula = arr_delay ~ ., data = md_pred_final, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3199 -0.3088 0.0219 0.3156 1.0437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.837e+00 3.453e-02 82.164 < 2e-16 ***
## op_unique_carrier2 -5.799e-02 3.327e-02 -1.743 0.081368 .
## op_unique_carrier3 -1.143e-01 3.145e-02 -3.635 0.000279 ***
## op_unique_carrier4 2.585e-02 3.451e-02 0.749 0.453913
## op_unique_carrier5 -6.535e-02 4.688e-02 -1.394 0.163349
## op_unique_carrier6 -4.504e-02 3.021e-02 -1.491 0.136077
## op_unique_carrier7 8.454e-02 8.402e-02 1.006 0.314313
## op_unique_carrier8 -1.847e-01 3.135e-02 -5.893 3.84e-09 ***
## op_unique_carrier9 1.598e-02 6.108e-02 0.262 0.793657
## op_unique_carrier10 -7.684e-02 3.336e-02 -2.303 0.021295 *
## op_unique_carrier11 -9.279e-02 3.015e-02 -3.077 0.002091 **
## op_unique_carrier12 5.850e-02 7.505e-02 0.779 0.435696
## op_unique_carrier13 -5.592e-02 2.938e-02 -1.903 0.057008 .
## op_unique_carrier14 -7.100e-02 3.322e-02 -2.137 0.032613 *
## op_unique_carrier15 -1.134e-01 4.031e-02 -2.814 0.004901 **
## op_unique_carrier16 -2.104e-02 3.070e-02 -0.685 0.493151
## dest2 9.927e-04 9.699e-03 0.102 0.918475
## dep_delay 5.597e-03 2.981e-05 187.757 < 2e-16 ***
## actual_elapsed_time 4.168e-03 1.901e-04 21.921 < 2e-16 ***
## distance -4.791e-04 2.149e-05 -22.293 < 2e-16 ***
## carrier_delay 1.427e-01 4.321e-03 33.018 < 2e-16 ***
## weather_delay 1.040e+00 4.616e-02 22.532 < 2e-16 ***
## nas_delay 1.643e-01 4.784e-03 34.335 < 2e-16 ***
## security_delay 6.181e-03 1.361e-03 4.540 5.64e-06 ***
## late_aircraft_delay 1.163e-01 2.112e-03 55.042 < 2e-16 ***
## dow2 -7.493e-03 1.072e-02 -0.699 0.484555
## dow3 4.629e-02 1.025e-02 4.517 6.29e-06 ***
## dow4 -3.093e-02 1.078e-02 -2.868 0.004129 **
## dow5 -4.480e-03 1.084e-02 -0.413 0.679430
## dow6 -1.205e-02 1.299e-02 -0.927 0.353870
## dow7 1.305e-02 1.100e-02 1.187 0.235247
## origin_freq2 -1.824e-02 9.170e-03 -1.989 0.046736 *
## origin_freq3 -5.297e-02 9.573e-03 -5.533 3.18e-08 ***
## origin_freq4 -6.609e-02 9.876e-03 -6.692 2.26e-11 ***
## Month2 1.581e-02 7.376e-03 2.143 0.032125 *
## Month3 -6.641e-02 7.674e-03 -8.654 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4456 on 21319 degrees of freedom
## (10411 observations deleted due to missingness)
## Multiple R-squared: 0.7064, Adjusted R-squared: 0.7059
## F-statistic: 1465 on 35 and 21319 DF, p-value: < 2.2e-16
Significant variables can be seen from the multiple linear regression model. All the delay variables are highly significant along with wednesday
plot(reg)QQ plot- Residuals are left skewed. From other 3 graphs we see there are some outliers.
Finding outliers in md_pred_final dataset
bp<- ggplot(data = md_pred_final) +
aes(x = arr_delay, y = arr_delay) +
geom_boxplot(fill = '#9e9ac8') +
labs(title = 'Box Plot of Arrival Delay') +
theme_linedraw()
histo <- ggplot(data = md_pred_final) +
aes(x = arr_delay) +
geom_histogram(bins = 12, fill = '#9e9ac8') +
labs(title = 'Histogram of Arrival Delay') +
theme_linedraw()
plot_grid(bp,histo)## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
From both the plots we can see that there are some outliers for more than 6.9 arr_delay value approx
Removing the outliers
md_pred_final <- md_pred_final%>%filter(arr_delay<6.9)We removed arr_delay values of more than 6.9
Part-6 Predictive Modeling
Splitting the data
train <- md_pred_final[sample(1:25411),]
colnames(train)## [1] "op_unique_carrier" "dest" "dep_delay"
## [4] "arr_delay" "actual_elapsed_time" "distance"
## [7] "carrier_delay" "weather_delay" "nas_delay"
## [10] "security_delay" "late_aircraft_delay" "dow"
## [13] "origin_freq" "Month"
test <- md_pred_final[sample(25411:nrow(md_pred_final)),]
colnames(test)## [1] "op_unique_carrier" "dest" "dep_delay"
## [4] "arr_delay" "actual_elapsed_time" "distance"
## [7] "carrier_delay" "weather_delay" "nas_delay"
## [10] "security_delay" "late_aircraft_delay" "dow"
## [13] "origin_freq" "Month"
write.csv(train,file="train.csv")
write.csv(test,file="test.csv")Using H2O AutoML
h2o.init()## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 19 hours 59 minutes
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.24.0.2
## H2O cluster version age: 18 days
## H2O cluster name: H2O_started_from_R_saurabhyelne_lbz280
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.96 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.2 (2017-09-28)
train <- h2o.importFile("train.csv")##
|
| | 0%
|
|======== | 12%
|
|=================================================================| 100%
test <- h2o.importFile("test.csv")##
|
| | 0%
|
|=================================================================| 100%
split_h2o <- h2o.splitFrame(as.h2o(train), ratios = c(0.85))
train_h2o <- split_h2o[[1]]
valid_h2o <- split_h2o[[2]]
test_h2o <- as.h2o(test)Here I am using H2O which is one of the advanced machine learning library and has several state of the art models. We divided data into three sets for training, validation and testing
Model generation
y <- 'arr_delay'
x <- setdiff(names(train),y)
#automl_models_h2o <- h2o.automl(
# x = x,
#y = y,
# training_frame = train_h2o,
# validation_frame = valid_h2o,
# leaderboard_frame = test_h2o,
#max_runtime_secs = 300,
#nfolds = 5
#)We provided training, validation and test dataset to the automl function. We are using 5 fold cross validation with maximum runtime of 100 seconds.
Best performing models
#automl_models_h2o@leaderboard
#print(paste0("The best performing model is ",automl_models_h2o@leader@model_id))The best model is GBM_1_AutoML_20190504_073500
Description about the best model - GBM_1_AutoML_20190504_073500
#automl_models_h2o@leader
Saving the top performing model to working directory
#h2o.getModel("GBM_1_AutoML_20190504_073500") %>%
# h2o.saveModel(path = "04_Modeling/h2o_models/")
#h2o_gbm <- h2o.loadModel("04_Modeling/h2o_models/GBM_1_AutoML_20190504_073500")Prediction by Gradient Boosting Machine model
#predictions <- h2o.predict(h2o_gbm, newdata = as.h2o(test))
#predictions_tbl <- predictions %>% as.tibble()
#predictions_tblPerformance of the top model
#performance_h2o_gbm <- h2o.performance(h2o_gbm, newdata = as.h2o(test))
#typeof(performance_h2o_gbm)
#performance_h2o_gbm %>% slotNames()
#performance_h2o@metrics
#train_rmse_gbm <- h2o.rmse(gbm_h2o, train = TRUE)
#test_rmse_gbm <- h2o.rmse(gbm_h2o, valid = TRUE)
#hold_rmse_gbm <- h2o.rmse(object = performance_h2o_gbm)
#print(paste0("GBM rmse TRAIN = ", train_rmse_gbm, ", rmse valid = ", test_rmse_gbm, ", rmse HOLDOUT = ",
# hold_rmse_gbm))
#h2o.varimp_plot(gbm_h2o, num_of_features = NULL)[1] “GBM rmse TRAIN = 8.34806208521452, rmse valid = 17.6673196637401, rmse HOLDOUT = 6.83885108336663”
The plot shows important variables for predicting arrival delay. Carrier delay is the most important factor
**Thank You**