1 Introduction: Airport Dash!

Airport are getting more and more crowded! Following the development of aircraft industry and more sophisticated air traffic control system, air transportation has become an ordinary transportation option for many people. This also followed by the increase of flights frequency and passager; a very challanging task for the airport management.

To help the airport management team, let’s try apply our knowledge about data science by forecasting flights frequency based on historical data using time series approach. This kind of forecasting and analysis can give a valuable isight for airport management team to plan its operational procedure and manage their recources better.

2 Basic Concept: Time Series

2.1 Time Series

Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting.

The data which formatted into a time series (ts) object must have some characteristics:

no missing intervals
no missing values
data should be ordered by time

2.2 Exploratory Data Analysis

A ts object can be decomposed into 3 main components which will be calculated for forecasting. These components are:

trend (T): the movement of mean, globally, throughout an interval;
seasonal (S): the pattern captured on each seasonal interval;
error (E): the pattern/value that cannot be captured by both trend and seasonal.

The relationship between those 3 components of ts object can be either additive or multiplicative, which can be seen on the line plot of the original data:

additive: T + S + E
multiplicative: T * S * E

Most of the analytic tools for time series are available for additive time series data. Even so, we can apply standard additive time series analysis to a multiplicative time series data by previously log-transformed our data. This action will transform our multiplicative time series data into additive time series data. Another way is to state this characteristic when building a time series model. This characteristic can also be stated during decomposition of a ts object during exploratory data analysis.

Besides forecasting, time series object can also be very useful for data analysis. Using seasonality adjustment (removing seasonal of a time series object) we can analyze the trend and detect anomaly events throughout the time series interval.

2.3 Forecast Model

There are few ways of forecasting in time series:

Naive Model

Predict future values based on 1 previous values.
Simple Moving Average (SMA)

Predict the value of a data point by calculating the average of n numbers of data before the data point. This method is more appropriate for a data with no trend and seasonal and used for smoothing error of the data. The drawback of this method is that it assigns equal weight for each data point, though in reality, newer data point should have higher weight than the overpast.
Exponential Smoothing

Predict future values by smoothing on error, trend, and seasonal, where it gives different weights on each data point used for prediction. Exponential Smoothing divided into three:
- simple exponential smoothing (SES): smoothing error
- double exponential smoothing (Holt): smoothing error & trend
- triple exponential smoothing (Holt Winters): smoothing error, trend, & seasonal
ARIMA

ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality). It predicts future values by combining Auto Regresive and Moving Averange method, whereas integrated stands for how many times differencing applied to the data (a way to transform non-stationary into stationary time series data). There is also a special approach for stationary data with seasonality that is using Seasonal ARIMA (SARIMA).

2.4 Forecasting & Evaluation

Forecasting can be done by using forecast() function from forecast package. Evaluation can be done by comparing errors of the prediction.

2.5 Assumption Check

There are two assumption for a time series analysis:

Normality: Shapiro.test
- H0 : residuals are normally distributed
- H1 : residuals are not normally distributed
Autocorrelations: Box.test - Ljng-Box
- H0 : No autocorrelations in the forecast errors
- H1 : there is an autocorrelations in the forecast errors

2.6 Multiple-seasonal Time Series

There are some cases that have multiple seasonal on their data and should be handled differently such as using seasonal time series approach (Seasonal ARIMA, etc.)

3 Solution

3.1 Import Library

library(dplyr) # for data wrangling
library(lubridate) # to dea with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error

3.2 Read Data

The data was obtained from nycflights13 which contains airline on-time data for all flights departed from NYC in 2013. The Airports include Newark Liberty International Airport, John F. Kennedy International Airport, and LaGuardia Airport.

flight <- nycflights13::flights
flight

Description for each column are:

year,month,day: Date of departure.
dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time,sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier: Two letter carrier (airlines) abbreviation.
flight: Flight number.
tailnum: Plane tail number.
origin,dest: Airports of origin and destination.
air_time: Amount of time spent in the air, in minutes.
distance: Distance between airports, in miles
hour,minute: Time of scheduled departure broken into hour and minutes.
time_hour: Scheduled date and hour of the flight (YYYYMMDD HHMMSS).

Each airport operates individually and therefore the forecasting should also be done separately. If we look at the number of flights departed from each airport, Newark Liberty International Airport (EWR) has the highest number of flight. Therefore we are going to forecast the flight demand of Newark Liberty International Airport.

data.frame(Airport = c("Newark Liberty International Airport", "John F. Kennedy International Airport", "LaGuardia Airport"), Flights = data.frame(table(flight$origin))[,2])

3.3 Data Aggregation

The data were aggregated until we obtain the number of flight departed from Newark Liberty International Airport, per hour.

flight_agg <- flight %>%
  group_by(origin, time_hour) %>% 
  summarise(flight = sum(n())) %>% 
  ungroup() %>% 
  filter(origin == "EWR") %>% 
  select(-origin)

## `summarise()` regrouping output by 'origin' (override with `.groups` argument)

flight_agg

3.4 Data Pre-processing

3.4.1 Impute Missing Records

Time Series forecasting require ordered and complete interval data. Because the data is exceptionally huge (6266 observations), it would be difficult to manually inspect missing hourly records in the data. We can perform padding using pad() function from padr package to add missing hourly records and impute them with zero value.

flight_agg <- flight_agg %>% 
  pad() %>%
  replace(., is.na(.), 0)

## pad applied on the interval: hour

flight_agg

3.4.2 Feature Engineering & Data Aggregation

There are various possibility of seasonality in our dataset. Flights frequency may be based on what hour, day, and month of the flights. In this analysis, I will try to provide daily data of flight frequency and analyze it as a time series object. This will remove the hourly effect on flight frequency, but it is a practical strategy for airports that prepare its recources for a daily basis. Furthermore, the dataset only contain records of 1 year and therefore is not available for yearly analysis.

daily_flight <- flight_agg %>% 
  mutate(year = year(time_hour),
         month = month(time_hour),
         day = day(time_hour),
         date = paste0(year,"-",month,"-",day),
         date = ymd(date)) %>%
  select(-time_hour, -year, -month, -day) %>% 
  group_by(date) %>% 
  summarise(flight = sum(flight))

## `summarise()` ungrouping output (override with `.groups` argument)

daily_flight

range(daily_flight$date)

## [1] "2013-01-01" "2013-12-31"

3.5 TS Object & EDA

# making initial ts object
air_ts <- ts(data = daily_flight$flight,
   start = range(daily_flight$date)[[1]],
   frequency = 7) # weekly seasonality

In this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).

# decompose ts object
air_deco <- decompose(air_ts)
autoplot(air_deco)

Based on the plot, the trend still shows a pattern of ups and downs which indicates a multiseasonal time series object. Therefore, we should try reanalyze this data through multiseasonal time series using quarter of a year.

# making 2nd ts object
daily_flight$flight %>% 
  msts(seasonal.periods = c(7,7*4)) %>% # multiseasonal ts (weekly, monthly)
  mstl() %>% # multiseasonal ts decomposition
  autoplot()

Based on the plot, there is still a pattern in trend and a different patern (expand-shrink in size) in seasonal. Therefore, another seasonal frequency should be used.

# making 3rd ts object
daily_flight$flight %>% 
  msts(seasonal.periods = c(7*4,7*4*3)) %>% # multiseasonal ts (monthly, quarterly)
  mstl() %>% # multiseasonal ts decomposition
  autoplot()

The last ts object showed the best decomposition among the 3 trials and therefore will be used for the time series model building. Additionally, note that the data is an additive time series and . This is a valuable information for the model building later.

# assign final ts object
air_msts <- daily_flight$flight %>% 
  msts(seasonal.periods = c(7*4,7*4*3))

# check for stationary
adf.test(air_msts)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  air_msts
## Dickey-Fuller = -3.5585, Lag order = 7, p-value = 0.03721
## alternative hypothesis: stationary

Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.

3.6 Cross Validation

The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.

air_train <- air_msts %>% head(length(air_msts) - 7*4)
air_test <- air_msts %>% tail(7*4)

3.7 Model Building

For model building, I will compare between two of the widely used time series modeling in business and industry: ETS Holt-Winters and Seasonal Arima. I use ETS Holt-Winters because my data contain both seasonal and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.

# ets Holt-Winters
air_ets <- stlm(air_train, method = "ets", lambda = 0) # no log transformation for addivite data
# SARIMA
air_arima <- stlm(air_train, method = "arima", lambda = 0)

3.8 Forecast & Evaluation

# forecast
air_ets_f <- forecast(air_ets, h = 28)
air_arima_f <- forecast(air_arima, h = 28)

air_ets_f # the forecasted value + confidence interval from ETS model

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 5.011905       345.0745 323.8011 367.7455 313.0758 380.3437
## 5.023810       346.2402 324.7451 369.1580 313.9119 381.8978
## 5.035714       346.2617 324.6166 369.3500 313.7117 382.1891
## 5.047619       248.6707 233.0202 265.3724 225.1380 274.6632
## 5.059524       317.4232 297.3113 338.8956 287.1858 350.8443
## 5.071429       347.1661 325.0239 370.8168 313.8801 383.9821
## 5.083333       341.8014 319.8589 365.2493 308.8193 378.3061
## 5.095238       346.3592 323.9806 370.2834 312.7255 383.6101
## 5.107143       348.4923 325.8327 372.7277 314.4401 386.2321
## 5.119048       347.7662 325.0118 372.1136 313.5754 385.6849
## 5.130952       249.8687 233.4185 267.4783 225.1534 277.2972
## 5.142857       312.3065 291.6200 334.4605 281.2297 346.8174
## 5.154762       345.1186 322.1206 369.7586 310.5732 383.5066
## 5.166667       334.1191 311.7213 358.1262 300.4790 371.5254
## 5.178571       347.4527 324.0240 372.5756 312.2679 386.6020
## 5.190476       323.7254 301.7695 347.2786 290.7562 360.4329
## 5.202381       326.0999 303.8560 349.9722 292.7017 363.3090
## 5.214286       242.0452 225.4411 259.8723 217.1176 269.8349
## 5.226190       313.3627 291.7456 336.5815 280.9126 349.5613
## 5.238095       348.2560 324.0986 374.2139 311.9964 388.7295
## 5.250000       346.5263 322.3572 372.5075 310.2529 387.0406
## 5.261905       350.1693 325.6138 376.5766 313.3199 391.3526
## 5.273810       335.5474 311.8912 360.9978 300.0512 375.2428
## 5.285714       335.5316 311.7513 361.1259 299.8528 375.4558
## 5.297619       247.1235 229.5173 266.0803 220.7107 276.6972
## 5.309524       320.6750 297.7102 345.4112 286.2268 359.2691
## 5.321429       351.1378 325.8626 378.3734 313.2277 393.6361
## 5.333333       346.1066 321.0673 373.0987 308.5541 388.2294

air_arima_f # the forcasted value + confidence interval from SARIMA model

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 5.011905       336.2589 316.7929 356.9210 306.9486 368.3680
## 5.023810       350.2987 328.3421 373.7235 317.2817 386.7515
## 5.035714       343.1734 321.6415 366.1466 310.7957 378.9240
## 5.047619       249.2586 233.3408 266.2623 225.3301 275.7281
## 5.059524       316.2016 295.9729 337.8129 285.7937 349.8449
## 5.071429       347.0129 324.6295 370.9397 313.3709 384.2665
## 5.083333       341.0105 318.9253 364.6251 307.8191 377.7808
## 5.095238       345.9137 323.3752 370.0232 312.0447 383.4589
## 5.107143       347.8473 325.0735 372.2166 313.6277 385.8006
## 5.119048       347.2304 324.3744 371.6968 312.8907 385.3389
## 5.130952       249.4412 232.9402 267.1112 224.6516 276.9663
## 5.142857       311.8014 291.0694 334.0101 280.6586 346.4000
## 5.154762       344.5427 321.5199 369.2140 309.9619 382.9815
## 5.166667       333.5710 311.1706 357.5839 299.9280 370.9876
## 5.178571       346.8773 323.4695 371.9791 311.7245 385.9943
## 5.190476       323.1920 301.2766 346.7016 290.2834 359.8313
## 5.202381       325.5611 303.3792 349.3649 292.2553 362.6625
## 5.214286       241.6460 225.1033 259.4043 216.8096 269.3274
## 5.226190       312.8453 291.3277 335.9522 280.5427 348.8674
## 5.238095       347.6813 323.6562 373.4897 311.6176 387.9186
## 5.250000       345.9543 321.9383 371.7618 309.9074 386.1939
## 5.261905       349.5914 325.2120 375.7983 313.0023 390.4576
## 5.273810       334.9935 311.5265 360.2283 299.7767 374.3475
## 5.285714       334.9778 311.4067 360.3331 299.6079 374.5233
## 5.297619       246.7156 229.2781 265.4794 220.5518 275.9832
## 5.309524       320.1457 297.4187 344.6094 286.0483 358.3076
## 5.321429       350.5582 325.5638 377.4715 313.0622 392.5452
## 5.333333       345.5354 320.7926 372.1865 308.4201 387.1170

# visualization
library(ggplot2)
library(gridExtra)

a <- autoplot(air_ets_f, series = "ETS", fcol = "red") +
  autolayer(air_msts, series = "Actual", color = "black") + 
  labs(subtitle = "Flight at Newark Liberty International Airport, from Jan - Dec 2013",
       y = "Flight Frequency") +
  theme_minimal()

b <- autoplot(air_arima_f, series = "ARIMA", fcol = "blue") +
  autolayer(air_msts, series = "Actual", color = "black") +
  labs(subtitle = "Flight at Newark Liberty International Airport, from Jan - Dec 2013",
       y = "Flight Frequency") +
  theme_minimal()

grid.arrange(a,b)

# model evaluation: root mean squared error (RMSE)
data.frame(ETS = RMSE(air_ets_f$mean, air_test), ARIMA = RMSE(air_arima_f$mean, air_test))

From the analysis above, we can conclude that we have sucsesfully forecast flight frequency and found ARIMA as the best model with the lowest error (RMSE ~33.86, although not very different from ETS model).

3.9 Assumption Check

1. Normality: Shapiro.test

H0 : residuals are normally distributed
H1 : residuals are not normally distributed

shapiro.test(air_arima_f$residuals) # p-value < 0.05; reject H0; accept H1

## 
##  Shapiro-Wilk normality test
## 
## data:  air_arima_f$residuals
## W = 0.83768, p-value < 0.00000000000000022

hist(air_arima_f$residuals, breaks = 20)

plot(air_arima_f$residuals)

2. Autocorrelation: Box.test - Ljng-Box

H0 : No autocorrelation in the forecast errors
H1 : there is an autocorrelation in the forecast errors

Box.test(air_arima_f$residuals, type = "Ljung-Box") # there is not enough data to reject H0

## 
##  Box-Ljung test
## 
## data:  air_arima_f$residuals
## X-squared = 0.0020123, df = 1, p-value = 0.9642

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.

In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. From that insight, airports can develop an standard operational procedure and smart strategies for dealing with such events.

Airport Dash

Nabiilah Ardini Fauziyyah

7/14/2019