Airport are getting more and more crowded! Following the development of aircraft industry and more sophisticated air traffic control system, air transportation has become an ordinary transportation option for many people. This also followed by the increase of flights frequency and passager; a very challanging task for the airport management.
To help the airport management team, let’s try apply our knowledge about data science by forecasting flights frequency based on historical data using time series approach. This kind of forecasting and analysis can give a valuable isight for airport management team to plan its operational procedure and manage their recources better.
Time series is a method of analyzing and processing data which the values are affected by time. The action of predicting future values based on its value in the previous period of time is called forecasting.
The data which formatted into a time series (ts) object must have some characteristics:
A ts object can be decomposed into 3 main components which will be calculated for forecasting. These components are:
The relationship between those 3 components of ts object can be either additive or multiplicative, which can be seen on the line plot of the original data:
Most of the analytic tools for time series are available for additive time series data. Even so, we can apply standard additive time series analysis to a multiplicative time series data by previously log-transformed our data. This action will transform our multiplicative time series data into additive time series data. Another way is to state this characteristic when building a time series model. This characteristic can also be stated during decomposition of a ts object during exploratory data analysis.
Besides forecasting, time series object can also be very useful for data analysis. Using seasonality adjustment (removing seasonal of a time series object) we can analyze the trend and detect anomaly events throughout the time series interval.
There are few ways of forecasting in time series:
Naive Model
Predict future values based on 1 previous values.
Simple Moving Average (SMA)
Predict the value of a data point by calculating the average of n numbers of data before the data point. This method is more appropriate for a data with no trend and seasonal and used for smoothing error of the data. The drawback of this method is that it assigns equal weight for each data point, though in reality, newer data point should have higher weight than the overpast.
Exponential Smoothing
Predict future values by smoothing on error, trend, and seasonal, where it gives different weights on each data point used for prediction. Exponential Smoothing divided into three:
ARIMA
ARIMA is an acronyms for Auto Regresive Integreted Moving Averange. ARIMA is great for predicting a stationary time series data. Stationary time series data has values which move around its mean (no trend nor seasonality). It predicts future values by combining Auto Regresive and Moving Averange method, whereas integrated stands for how many times differencing applied to the data (a way to transform non-stationary into stationary time series data). There is also a special approach for stationary data with seasonality that is using Seasonal ARIMA (SARIMA).
Forecasting can be done by using forecast() function from forecast package. Evaluation can be done by comparing errors of the prediction.
There are two assumption for a time series analysis:
Normality: Shapiro.test
Autocorrelations: Box.test - Ljng-Box
There are some cases that have multiple seasonal on their data and should be handled differently such as using seasonal time series approach (Seasonal ARIMA, etc.)
The data was obtained from nycflights13 which contains airline on-time data for all flights departed from NYC in 2013. The Airports include Newark Liberty International Airport, John F. Kennedy International Airport, and LaGuardia Airport.
Description for each column are:
year,month,day: Date of departure.dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local tz.sched_dep_time,sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local tz.dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.carrier: Two letter carrier (airlines) abbreviation.flight: Flight number.tailnum: Plane tail number.origin,dest: Airports of origin and destination.air_time: Amount of time spent in the air, in minutes.distance: Distance between airports, in mileshour,minute: Time of scheduled departure broken into hour and minutes.time_hour: Scheduled date and hour of the flight (YYYYMMDD HHMMSS).Each airport operates individually and therefore the forecasting should also be done separately. If we look at the number of flights departed from each airport, Newark Liberty International Airport (EWR) has the highest number of flight. Therefore we are going to forecast the flight demand of Newark Liberty International Airport.
data.frame(Airport = c("Newark Liberty International Airport", "John F. Kennedy International Airport", "LaGuardia Airport"), Flights = data.frame(table(flight$origin))[,2])The data were aggregated until we obtain the number of flight departed from Newark Liberty International Airport, per hour.
flight_agg <- flight %>%
group_by(origin, time_hour) %>%
summarise(flight = sum(n())) %>%
ungroup() %>%
filter(origin == "EWR") %>%
select(-origin)## `summarise()` regrouping output by 'origin' (override with `.groups` argument)
Time Series forecasting require ordered and complete interval data. Because the data is exceptionally huge (6266 observations), it would be difficult to manually inspect missing hourly records in the data. We can perform padding using pad() function from padr package to add missing hourly records and impute them with zero value.
## pad applied on the interval: hour
There are various possibility of seasonality in our dataset. Flights frequency may be based on what hour, day, and month of the flights. In this analysis, I will try to provide daily data of flight frequency and analyze it as a time series object. This will remove the hourly effect on flight frequency, but it is a practical strategy for airports that prepare its recources for a daily basis. Furthermore, the dataset only contain records of 1 year and therefore is not available for yearly analysis.
daily_flight <- flight_agg %>%
mutate(year = year(time_hour),
month = month(time_hour),
day = day(time_hour),
date = paste0(year,"-",month,"-",day),
date = ymd(date)) %>%
select(-time_hour, -year, -month, -day) %>%
group_by(date) %>%
summarise(flight = sum(flight))## `summarise()` ungrouping output (override with `.groups` argument)
## [1] "2013-01-01" "2013-12-31"
# making initial ts object
air_ts <- ts(data = daily_flight$flight,
start = range(daily_flight$date)[[1]],
frequency = 7) # weekly seasonalityIn this section, I tried to explore whether our timeseries object has trend and seasonal properties (one-seasonal/multiseasonal).
Based on the plot, the trend still shows a pattern of ups and downs which indicates a multiseasonal time series object. Therefore, we should try reanalyze this data through multiseasonal time series using quarter of a year.
# making 2nd ts object
daily_flight$flight %>%
msts(seasonal.periods = c(7,7*4)) %>% # multiseasonal ts (weekly, monthly)
mstl() %>% # multiseasonal ts decomposition
autoplot() Based on the plot, there is still a pattern in trend and a different patern (expand-shrink in size) in seasonal. Therefore, another seasonal frequency should be used.
# making 3rd ts object
daily_flight$flight %>%
msts(seasonal.periods = c(7*4,7*4*3)) %>% # multiseasonal ts (monthly, quarterly)
mstl() %>% # multiseasonal ts decomposition
autoplot() The last ts object showed the best decomposition among the 3 trials and therefore will be used for the time series model building. Additionally, note that the data is an additive time series and . This is a valuable information for the model building later.
# assign final ts object
air_msts <- daily_flight$flight %>%
msts(seasonal.periods = c(7*4,7*4*3))
# check for stationary
adf.test(air_msts)##
## Augmented Dickey-Fuller Test
##
## data: air_msts
## Dickey-Fuller = -3.5585, Lag order = 7, p-value = 0.03721
## alternative hypothesis: stationary
Based on Augmented Dickey-Fuller Test (adf.test) result, the p-value is < alpha and therefore the data is already stationary. Therefore, for a model building using SARIMA, we do not need to perform differencing on the data first.
The cross-validation scheme for time series should not be sampled randomly, but splitted sequentially.
For model building, I will compare between two of the widely used time series modeling in business and industry: ETS Holt-Winters and Seasonal Arima. I use ETS Holt-Winters because my data contain both seasonal and trend. I also want to compare it between seasonal Arima to check whether seasonal ARIMA can give better forecasting performance.
# forecast
air_ets_f <- forecast(air_ets, h = 28)
air_arima_f <- forecast(air_arima, h = 28)
air_ets_f # the forecasted value + confidence interval from ETS model## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 5.011905 345.0745 323.8011 367.7455 313.0758 380.3437
## 5.023810 346.2402 324.7451 369.1580 313.9119 381.8978
## 5.035714 346.2617 324.6166 369.3500 313.7117 382.1891
## 5.047619 248.6707 233.0202 265.3724 225.1380 274.6632
## 5.059524 317.4232 297.3113 338.8956 287.1858 350.8443
## 5.071429 347.1661 325.0239 370.8168 313.8801 383.9821
## 5.083333 341.8014 319.8589 365.2493 308.8193 378.3061
## 5.095238 346.3592 323.9806 370.2834 312.7255 383.6101
## 5.107143 348.4923 325.8327 372.7277 314.4401 386.2321
## 5.119048 347.7662 325.0118 372.1136 313.5754 385.6849
## 5.130952 249.8687 233.4185 267.4783 225.1534 277.2972
## 5.142857 312.3065 291.6200 334.4605 281.2297 346.8174
## 5.154762 345.1186 322.1206 369.7586 310.5732 383.5066
## 5.166667 334.1191 311.7213 358.1262 300.4790 371.5254
## 5.178571 347.4527 324.0240 372.5756 312.2679 386.6020
## 5.190476 323.7254 301.7695 347.2786 290.7562 360.4329
## 5.202381 326.0999 303.8560 349.9722 292.7017 363.3090
## 5.214286 242.0452 225.4411 259.8723 217.1176 269.8349
## 5.226190 313.3627 291.7456 336.5815 280.9126 349.5613
## 5.238095 348.2560 324.0986 374.2139 311.9964 388.7295
## 5.250000 346.5263 322.3572 372.5075 310.2529 387.0406
## 5.261905 350.1693 325.6138 376.5766 313.3199 391.3526
## 5.273810 335.5474 311.8912 360.9978 300.0512 375.2428
## 5.285714 335.5316 311.7513 361.1259 299.8528 375.4558
## 5.297619 247.1235 229.5173 266.0803 220.7107 276.6972
## 5.309524 320.6750 297.7102 345.4112 286.2268 359.2691
## 5.321429 351.1378 325.8626 378.3734 313.2277 393.6361
## 5.333333 346.1066 321.0673 373.0987 308.5541 388.2294
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 5.011905 336.2589 316.7929 356.9210 306.9486 368.3680
## 5.023810 350.2987 328.3421 373.7235 317.2817 386.7515
## 5.035714 343.1734 321.6415 366.1466 310.7957 378.9240
## 5.047619 249.2586 233.3408 266.2623 225.3301 275.7281
## 5.059524 316.2016 295.9729 337.8129 285.7937 349.8449
## 5.071429 347.0129 324.6295 370.9397 313.3709 384.2665
## 5.083333 341.0105 318.9253 364.6251 307.8191 377.7808
## 5.095238 345.9137 323.3752 370.0232 312.0447 383.4589
## 5.107143 347.8473 325.0735 372.2166 313.6277 385.8006
## 5.119048 347.2304 324.3744 371.6968 312.8907 385.3389
## 5.130952 249.4412 232.9402 267.1112 224.6516 276.9663
## 5.142857 311.8014 291.0694 334.0101 280.6586 346.4000
## 5.154762 344.5427 321.5199 369.2140 309.9619 382.9815
## 5.166667 333.5710 311.1706 357.5839 299.9280 370.9876
## 5.178571 346.8773 323.4695 371.9791 311.7245 385.9943
## 5.190476 323.1920 301.2766 346.7016 290.2834 359.8313
## 5.202381 325.5611 303.3792 349.3649 292.2553 362.6625
## 5.214286 241.6460 225.1033 259.4043 216.8096 269.3274
## 5.226190 312.8453 291.3277 335.9522 280.5427 348.8674
## 5.238095 347.6813 323.6562 373.4897 311.6176 387.9186
## 5.250000 345.9543 321.9383 371.7618 309.9074 386.1939
## 5.261905 349.5914 325.2120 375.7983 313.0023 390.4576
## 5.273810 334.9935 311.5265 360.2283 299.7767 374.3475
## 5.285714 334.9778 311.4067 360.3331 299.6079 374.5233
## 5.297619 246.7156 229.2781 265.4794 220.5518 275.9832
## 5.309524 320.1457 297.4187 344.6094 286.0483 358.3076
## 5.321429 350.5582 325.5638 377.4715 313.0622 392.5452
## 5.333333 345.5354 320.7926 372.1865 308.4201 387.1170
# visualization
library(ggplot2)
library(gridExtra)
a <- autoplot(air_ets_f, series = "ETS", fcol = "red") +
autolayer(air_msts, series = "Actual", color = "black") +
labs(subtitle = "Flight at Newark Liberty International Airport, from Jan - Dec 2013",
y = "Flight Frequency") +
theme_minimal()
b <- autoplot(air_arima_f, series = "ARIMA", fcol = "blue") +
autolayer(air_msts, series = "Actual", color = "black") +
labs(subtitle = "Flight at Newark Liberty International Airport, from Jan - Dec 2013",
y = "Flight Frequency") +
theme_minimal()
grid.arrange(a,b)# model evaluation: root mean squared error (RMSE)
data.frame(ETS = RMSE(air_ets_f$mean, air_test), ARIMA = RMSE(air_arima_f$mean, air_test))From the analysis above, we can conclude that we have sucsesfully forecast flight frequency and found ARIMA as the best model with the lowest error (RMSE ~33.86, although not very different from ETS model).
1. Normality: Shapiro.test
##
## Shapiro-Wilk normality test
##
## data: air_arima_f$residuals
## W = 0.83768, p-value < 0.00000000000000022
2. Autocorrelation: Box.test - Ljng-Box
##
## Box-Ljung test
##
## data: air_arima_f$residuals
## X-squared = 0.0020123, df = 1, p-value = 0.9642
Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.
In a time series, such errors might emerge from various unpredictable events and is actually quite unavoidable. One strategy to overcome it is to analyze what kinds of unpredictable events that might occur and occurs frequently. This can be done by time series analysis using seasonality adjustment. From that insight, airports can develop an standard operational procedure and smart strategies for dealing with such events.