This paper aims to show what was learned in this course and provide a potential solution to some of the problems many restaurants face in forecasting visitor traffic. The exercise relies on Kaggle’s competition under “Recruit Restaurant Visitor Forecasting,” which takes data from two different sources, similar to Open Table or Resy in the United States. The specific goal is to submit a forecast with the data provided after exploring and transforming the necessary data. To submit a forecast, one has to perform several time-series analyses using different models, such as autoregressive integrated moving averages (ARIMA), exponential smoothing (ETS), and neural networks (NNA).
Empirically, many people know that predicting potential visitor traffic to a restaurant is probably impossible. However, by analyzing the available data, one can at least estimate potential traffic. Working in the food industry for the past 16 years as a price analyst, my experience with many in the foodservice (restaurant) sector is that it is not uncommon for restaurants to provide faulty demand forecasts to their suppliers. As a result, restaurants could have too much or too little food at any given time, and thus potentially increase costs. The same can happen to other resources, like personnel.
The approach taken in this paper focuses on univariate analyses and does not consider explanatory variables such as weather, incomes, etc.
In some of our approaches, we focused on the seasonal patterns and the results used by some metrics, such as RMSE (G. Boomija, A. Anandaraj, S. Nandhini, and S. Lavanya, 2016). On the seasonality component, when performing an ARIMA, we also set the frequency at 7 (Mingming Hu, Richard T.R. Qiu, Doris Chenguang Wu, Haiyan Song, 2021). Many papers suggest taking an ARMA-based approach initially and then testing whether they could be integrated through an ARIMA or SARIMAX model (Fong-Lin Chu, 2009). However, many references utilize univariate time series using other models, such as ETS and simple NNK. However, other papers incorporate exogenous variables, such as income (Reynold, Rahman, Balinbinc, 2013). ETS models performed better in univariate, single models than Naïve or SARIMA (Qiu, Wu, Dropsy, Petit, Pratt, Ohe, 2021). There is an insurmountable amount of literature about forecast modeling in the tourism, restaurants, and hospitality industries. For now, we focused on what was covered throughout the semester.
air_visit_data.csv: historical data on visitors per restaurant from the AirREGI dataset. We learned that this will be our training data throughout the data manipulation since it is the only dataset with visitor information.
air_reserve.csv: historical data on reservations from the AirREGI dataset.
hpg_reserve.csv: historical data on reservations for the HPG (Hot Pepper Gourmet) dataset.
air_store_info.csv: Information about the genre (type of cuisines) and location of each store/restaurant in the AirREGI dataset.
hpg_store_info.csv: Information about the genre (type of cuisines) and location of each store/restaurant in the HPG dataset. store_id_relation.csv connects the codes for each restaurant for both the AirREGI and HPG datasets.
date_info.csv Provides information on holiday dates in Japan. sample_submission.csv This is the submission file. It helps structure the deliverable for final submission, matching store ids and visitor forecasts.
There were significant transformations to the datasets to set up the data frames to the required format. Dates were set for relatively easy manipulation, but using “ts” or “tsibbles” between fable and the forecast packages became a massive challenge throughout the semester. However, I was able to borrow from multiple notebooks, and in the end, I was partially successful in manipulating the data in some of the desired ways.
Load the datasets.
# Air
air_res_tib <- as.tibble(fread('data/air_reserve.csv'))
air_store_tib <- as.tibble(fread('data/air_store_info.csv'))
air_visits_tib <- as.tibble(fread('data/air_visit_data.csv'))
# HPG
hpg_res_tib <- as.tibble(fread('data/hpg_reserve.csv'))
hpg_store_tib <- as.tibble(fread('data/hpg_store_info.csv'))
# other to melt
dates_tib <- as.tibble(fread('data/date_info.csv'))
store_id_rel_tib <- as.tibble(fread('data/store_id_relation.csv'))
# Submission -
test_tib <- as.tibble(fread('data/sample_submission.csv'))
From the initial discoveries, we notice that the earliest day was January 1st, 2016, and it ended on April 22nd, 2017.
The number of visitors varies widely, from 1 to 877 as the max, but the median is around 17 visitors at any given day per store/location.
We also found 829 “unique” stores or codes per restaurant.
Reservations from the AirREGI dataset have a maximum of 100 per day per store.
# Air visits
knitr::kable(summary(air_visits_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")
| air_store_id | visit_date | visitors | |
|---|---|---|---|
| Length:252108 | Min. :2016-01-01 | Min. : 1.00 | |
| Class :character | 1st Qu.:2016-07-23 | 1st Qu.: 9.00 | |
| Mode :character | Median :2016-10-23 | Median : 17.00 | |
| ** | Mean :2016-10-12 | Mean : 20.97 | |
| ** | 3rd Qu.:2017-01-24 | 3rd Qu.: 29.00 | |
| ** | Max. :2017-04-22 | Max. :877.00 |
glimpse(air_visits_tib)
## Rows: 252,108
## Columns: 3
## $ air_store_id <chr> "air_ba937bf13d40fb24", "air_ba937bf13d40fb24", "air_ba93~
## $ visit_date <date> 2016-01-13, 2016-01-14, 2016-01-15, 2016-01-16, 2016-01-~
## $ visitors <int> 25, 32, 29, 22, 6, 9, 31, 21, 18, 26, 21, 11, 24, 21, 26,~
nrows_air_v <- air_visits_tib %>%
distinct(air_store_id) %>%
nrow()
knitr::kable(paste0("the number of distinct stores is ", nrows_air_v), "html")%>%
kable_styling("striped", position = "left")
| x |
|---|
| the number of distinct stores is 829 |
#-----------------------------------------
# Air reservations
knitr::kable(summary(air_res_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")
| air_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
|---|---|---|---|---|
| Length:92378 | Min. :2016-01-01 19:00:00 | Min. :2016-01-01 01:00:00 | Min. : 1.000 | |
| Class :character | 1st Qu.:2016-11-15 19:00:00 | 1st Qu.:2016-11-07 17:00:00 | 1st Qu.: 2.000 | |
| Mode :character | Median :2017-01-05 18:00:00 | Median :2016-12-27 22:00:00 | Median : 3.000 | |
| ** | Mean :2016-12-05 08:18:58 | Mean :2016-11-27 01:13:07 | Mean : 4.482 | |
| ** | 3rd Qu.:2017-03-03 19:00:00 | 3rd Qu.:2017-02-26 18:00:00 | 3rd Qu.: 5.000 | |
| ** | Max. :2017-05-31 21:00:00 | Max. :2017-04-22 23:00:00 | Max. :100.000 |
glimpse(air_res_tib)
## Rows: 92,378
## Columns: 4
## $ air_store_id <chr> "air_877f79706adbfb06", "air_db4b38ebe7a7ceff", "air_~
## $ visit_datetime <dttm> 2016-01-01 19:00:00, 2016-01-01 19:00:00, 2016-01-01~
## $ reserve_datetime <dttm> 2016-01-01 16:00:00, 2016-01-01 19:00:00, 2016-01-01~
## $ reserve_visitors <int> 1, 3, 6, 2, 5, 2, 4, 2, 2, 2, 3, 3, 2, 6, 7, 41, 13, ~
nrows_air_res <- air_res_tib %>%
distinct(air_store_id) %>%
nrow()
knitr::kable(paste0("the number of rows is ", nrows_air_res), "html")%>%
kable_styling("striped", position = "left")
| x |
|---|
| the number of rows is 314 |
#-----------------------------------------
# Air Stores
knitr::kable(summary(air_store_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")
| air_store_id | air_genre_name | air_area_name | latitude | longitude | |
|---|---|---|---|---|---|
| Length:829 | Length:829 | Length:829 | Min. :33.21 | Min. :130.2 | |
| Class :character | Class :character | Class :character | 1st Qu.:34.70 | 1st Qu.:135.3 | |
| Mode :character | Mode :character | Mode :character | Median :35.66 | Median :139.7 | |
| ** | ** | ** | Mean :35.65 | Mean :137.4 | |
| ** | ** | ** | 3rd Qu.:35.69 | 3rd Qu.:139.8 | |
| ** | ** | ** | Max. :44.02 | Max. :144.3 |
glimpse(air_store_tib)
## Rows: 829
## Columns: 5
## $ air_store_id <chr> "air_0f0cdeee6c9bf3d7", "air_7cc17a324ae5c7dc", "air_fe~
## $ air_genre_name <chr> "Italian/French", "Italian/French", "Italian/French", "~
## $ air_area_name <chr> "HyÅ\215go-ken KÅ\215be-shi KumoidÅ\215ri", "HyÅ\215go-~
## $ latitude <dbl> 34.69512, 34.69512, 34.69512, 34.69512, 35.65807, 35.65~
## $ longitude <dbl> 135.1979, 135.1979, 135.1979, 135.1979, 139.7516, 139.7~
nrows_air_st <- air_store_tib %>%
distinct(air_store_id) %>%
nrow()
knitr::kable(paste0("the number of rows is ", nrows_air_st), "html")%>%
kable_styling("striped", position = "left")
| x |
|---|
| the number of rows is 829 |
#-----------------------------------------
# HPG reservations
knitr::kable(summary(hpg_res_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")
| hpg_store_id | visit_datetime | reserve_datetime | reserve_visitors | |
|---|---|---|---|---|
| Length:2000320 | Min. :2016-01-01 11:00:00 | Min. :2016-01-01 00:00:00 | Min. : 1.000 | |
| Class :character | 1st Qu.:2016-06-26 19:00:00 | 1st Qu.:2016-06-21 12:00:00 | 1st Qu.: 2.000 | |
| Mode :character | Median :2016-11-19 20:00:00 | Median :2016-11-10 20:00:00 | Median : 3.000 | |
| ** | Mean :2016-10-15 06:55:20 | Mean :2016-10-07 19:57:59 | Mean : 5.074 | |
| ** | 3rd Qu.:2017-02-03 19:00:00 | 3rd Qu.:2017-01-27 13:00:00 | 3rd Qu.: 6.000 | |
| ** | Max. :2017-05-31 23:00:00 | Max. :2017-04-22 23:00:00 | Max. :100.000 |
glimpse(hpg_res_tib)
## Rows: 2,000,320
## Columns: 4
## $ hpg_store_id <chr> "hpg_c63f6f42e088e50f", "hpg_dac72789163a3f47", "hpg_~
## $ visit_datetime <dttm> 2016-01-01 11:00:00, 2016-01-01 13:00:00, 2016-01-01~
## $ reserve_datetime <dttm> 2016-01-01 09:00:00, 2016-01-01 06:00:00, 2016-01-01~
## $ reserve_visitors <int> 1, 3, 2, 5, 13, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5, 4, 2, 4~
nrows_hpg_res <- air_store_tib %>%
distinct(air_store_id) %>%
nrow()
knitr::kable(paste0("the number of rows is ", nrows_hpg_res), "html")%>%
kable_styling("striped", position = "left")
| x |
|---|
| the number of rows is 829 |
#-----------------------------------------
This is the code for formatting the dates for almost all purposes.
# Re-format the dates.
# Although the tibbles are already reading as tables
# we need to re-format further
# from be my guest
# Air dataset
air_visits_tib <- air_visits_tib %>%
mutate(visit_date = ymd(visit_date))
air_res_tib <- air_res_tib %>%
mutate(visit_datetime = ymd_hms(visit_datetime),
reserve_datetime = ymd_hms(reserve_datetime))
air_store_tib <- air_store_tib %>%
mutate(air_genre_name = as.factor(air_genre_name),
air_area_name = as.factor(air_area_name))
# Dates and Holidays
dates_tib <- dates_tib %>%
mutate(holiday_flg = as.logical(holiday_flg),
date = ymd(calendar_date))
# Holidays and air visits (since it will be the target variable)
holidays <- dates_tib %>% filter(holiday_flg == 1)
air_visits_tib$visit_date <- as.Date(air_visits_tib$visit_date)
air_visits_tib$is_holiday <- factor(ifelse(air_visits_tib$visit_date %in% holidays$calendar_date, 1, 0))
holiday <-
# HPG dataset
hpg_res_tib <- hpg_res_tib %>%
mutate(visit_datetime = ymd_hms(visit_datetime),
reserve_datetime = ymd_hms(reserve_datetime))
hpg_store_tib <- hpg_store_tib %>%
mutate(hpg_genre_name = as.factor(hpg_genre_name),
hpg_area_name = as.factor(hpg_area_name))
We had to create an aggregate dataset of all restaurants to assess modeling. Because there are 829 different stores, it would be nearly impossible to assess store by store on their traffic throughout the period analyzed. Instead, we approached this data from the sum of all stores by day. From just a visual point of view, we found a vital seasonal component, naturally, from a daily perspective.
As seen from the figure above, there appears to be a structural change in July 2016. We would have to control for those changes when modeling. We then looked at the overall distribution of visitors for daily visits. While they are somewhat normally distributed, we logged the data to scale it and get a more apparent normal distribution.
We then looked at the average visitors by day—not the median. The purpose of this exercise is to try to identify any patterns. Not surprisingly, we found that weekends are usually the highest traffic peaking on Saturday.
We also looked at visitors monthly, and unsurprisingly, December tends to be the month with the highest visitor traffic.
Finally, we noticed many more stores registered under Hot Pepper Gourmet than AirREGI in terms of genre. In other words, there are more categories and more stores reporting information to the HPG database.
# air visits - Line Graph (structural change)
air_visits_tib %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(visitors), .groups = 'drop') %>%
ggplot(aes(visit_date, all_visitors)) +
geom_line(col = "dodgerblue3", size = 0.7
) +
geom_smooth(method = "loess", col = "orangered", size = 0.5, span =1/7, formula = y ~ x) +
labs(x = "All visitors"
, y = "Date"
, title = "Visitors by day"
, subtitle = "AirReg") +
theme_ipsum_es()
# Identify Holidays
d1 <- air_visits_tib %>%
group_by(visit_date, is_holiday) %>%
summarise(all_visitors = sum(visitors), .groups = 'drop')
d2 <- d1 %>% filter(is_holiday == 1)
ggplot() +
geom_point(aes(x = visit_date, y = all_visitors), data = d2, color = 'orangered', size = 2) +
geom_line(aes(x = visit_date, y = all_visitors), data = d1,color = 'dodgerblue3', alpha = 0.6) +
labs(y = 'All visitors', x = 'Date',
title = 'Total Visits by date'
, subtitle = "AirReg") +
theme_ipsum_es() +
scale_x_date(date_breaks = '1 month') +
theme(axis.text.x = element_text(angle=90, hjust = 1))
# Distribution of visits not as tibble, but as vector
#air_visits_nt <- fread('data/air_visit_data.csv') #nt = not tibble
air_visits_tib %>%
group_by(visit_date) %>%
summarise(total_visits = sum(visitors), .groups = 'drop') %>%
ggplot(aes(x = total_visits)) +
geom_histogram(fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
theme_ipsum_es() +
labs(x = 'Daily visits'
,y = "Number of Visitors"
,title = 'Distribution of daily visits, not logged'
,subtitle = "AirReg")
# Distribution of visits, logged
# Needed to do the log transformation to get a normal distribution
air_visits_tib %>%
ggplot(aes(visitors)) +
geom_vline(xintercept = 20, color = "dodgerblue", size = 2) +
geom_histogram(fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
scale_x_log10() +
labs(x = 'Daily visits'
,title = 'Distribution of daily visits, logged'
,subtitle = "AirReg") +
theme_ipsum_es()
# by day of the week
air_visits_tib %>%
mutate(wday = weekdays(visit_date)) %>%
group_by(wday) %>%
summarise(visits = mean(visitors), .groups = 'drop') %>%
ggplot(aes(wday, visits, fill = wday)) +
theme_ipsum_es() +
geom_col(show.legend = FALSE, fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
theme(legend.position = "none", axis.text.x = element_text(angle=90, hjust=1, vjust=1)) +
labs(x = "Day of the week"
, y = "Average visitors"
, title = "Average Visitors by Weekday"
, subtitle = "AirReg")
# by month
air_visits_tib %>%
mutate(month = months(visit_date)) %>%
group_by(month) %>%
summarise(visits = mean(visitors), .groups = 'drop') %>%
ggplot(aes(month, visits, fill = month)) +
theme_ipsum_es() +
geom_col(show.legend = FALSE, fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
scale_x_discrete(limits = month.name) +
theme(legend.position = "none", axis.text.x = element_text(angle=90, hjust=1, vjust=1))+
labs(x = "Month Number"
, y = "Average Visitors"
, title = "Monthly Average Restaurant Visitors"
, subtitle = "AirReg")
# Target Variable
air_visits_tib %>%
filter(visit_date > ymd("2016-04-15") & visit_date < ymd("2016-06-15")) %>%
group_by(visit_date) %>%
summarise(all_visitors = sum(visitors), .groups = 'drop') %>%
ggplot(aes(visit_date,all_visitors)) +
geom_line(color = 'dodgerblue3', size = 1) +
geom_smooth(method = "loess", color = "orangered", size = 0.5, span = 1/7,formula = y ~ x) + #1/7 for the 7 days for weekdays
labs(x = "All visitors"
, y = "Date"
, title = " All visitors"
, subtitle = "AirReg") +
theme_ipsum_es()
# HPG vs AirReg
t1 <- air_store_tib %>%
group_by(air_genre_name) %>%
summarise(N = n(), .groups = 'drop') %>%
ggplot(aes(x = fct_reorder(air_genre_name, N), y = N)) +
geom_bar(stat= 'identity'
, fill = 'orangered'
, col=I("orangered")
, size = 0.5
, alpha = 0.5) +
labs(x = ''
, y = 'Number of stores'
, title = 'AirREGI') +
coord_flip() +
theme_bw()
t2 <- hpg_store_tib %>%
group_by(hpg_genre_name) %>%
summarise(N = n(), .groups = 'drop') %>%
ggplot(aes(x = fct_reorder(hpg_genre_name, N), y = N)) +
geom_bar(stat= 'identity'
, fill = 'dodgerblue3'
, col=I("dodgerblue3")
, size = 0.5
, alpha = 0.5) +
labs(x = ''
, y = 'Number of stores'
, title = 'HP Gourmet', size = 5) +
coord_flip() +
theme_bw()
grid.arrange(t2, t1, ncol = 2)
# map
leaflet(air_store_tib) %>%
addTiles() %>%
addProviderTiles("CartoDB.Positron") %>%
addMarkers(~longitude, ~latitude,
popup = ~air_store_id, label = ~air_genre_name,
clusterOptions = markerClusterOptions())
When decomposing the aggregated data mentioned above, we noticed that our ACF declined gradually, and our PACF dropped suddenly after the 8th lag. We ran a KPSS test to assess stationarity and see the differences needed.
Once differenced, our KPSS test confirmed that one difference would suffice. Here our ACF and PACF showed strong patterns, which can help us assess some of the parameters in our modeling later on. The ACF showed statistically significant correlations not only every seven days but every three too.
### Stationarity, PACF and ACF
# set up aggregate
air_agg_ts <- air_visits_tib %>%
group_by(visit_date, is_holiday) %>%
summarise(all_visitors = sum(visitors), .groups="drop") %>%
as_tsibble(index = visit_date)
# ACF and PACF
air_visit_ts <- ts(subset(air_agg_ts, select = "all_visitors", frequency =7))
ggtsdisplay(air_visit_ts, main = "AirReg Visits")
# Differences
air_agg_ts %>%
gg_tsdisplay(difference(all_visitors), plot_type = 'partial') +
labs(title = "AirReg visitors Differenced (1)")
# Check for stationarity
air_agg_ts$all_visitors %>%
unitroot_kpss() %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| x | |
|---|---|
| kpss_stat | 6.380794 |
| kpss_pvalue | 0.010000 |
# One Difference
air_agg_ts$all_visitors %>%
diff() %>%
unitroot_kpss()%>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| x | |
|---|---|
| kpss_stat | 0.0175277 |
| kpss_pvalue | 0.1000000 |
We then started running the models. We started with automated results from the “forecast” package. But before doing that, we arbitrarily decided to run the model after the structural change seen in July 2016. In other words, our train dataset cut date began after July 2016.
# Borrowed from: https://www.kaggle.com/code/mymy610/forecasting-models-arima-ets
# set dates for train and test
cut1 <- air_agg_ts %>%
filter(visit_date >='2016-07-01')
par(mfrow=c(1,1))
plot(cut1$all_visitors, type = 'l', main = "Train data post July 2016")
cut1.train <- cut1 %>%
filter(visit_date < '2017-03-01')
cut1.test <- cut1 %>%
filter(visit_date >= '2017-03-01')
#-----------------------------------------#
# Check ACF and PACF, again
par(mfrow=c(1,2))
acf(cut1$all_visitors, main = "ACF post July '16")
pacf(cut1$all_visitors,main = "PACF post July '16")
The results from the auto ARIMA showed our p, d, q components at 2, 0, 3, respectively. This result is consistent with the visual and empirical assessment made prior using the ACF and PACF. Our residuals were also normally distributed with only a few outliers. When plotting the forecast, it appeared, in many ways, reasonable. Our manual models use diffrent p, d, q parameters with diffrent results.
#-----------------------------------------#
# AUTO ARIMA
cut1.arima <- auto.arima(ts(cut1.train$all_visitors,frequency = 7))
summary(cut1.arima)
## Series: ts(cut1.train$all_visitors, frequency = 7)
## ARIMA(2,0,3)(2,1,0)[7]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 sar1 sar2
## -0.3419 0.6301 1.1406 -0.0159 -0.1910 -0.4912 -0.3604
## s.e. 0.1669 0.1583 0.1885 0.3109 0.1417 0.0626 0.0610
##
## sigma^2 estimated as 2877590: log likelihood=-2088.19
## AIC=4192.38 AICc=4193.01 BIC=4220.09
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 8.187886 1646.755 1099.071 -1.35863 8.691962 0.708948 0.004408588
checkresiduals(cut1.arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,0,3)(2,1,0)[7]
## Q* = 13.654, df = 7, p-value = 0.05769
##
## Model df: 7. Total lags used: 14
cut1.arima.fc <- forecast(cut1.arima, h=53)
acc.cut1.arima <- accuracy(cut1.arima.fc, cut1.test$all_visitors)
# Plot AUTO ARIMA
par(mfrow=c(1,1))
plot(forecast(cut1.arima, h=53))
#-----------------------------------------#
# ARIMA, (7,1,7)
cut1.arima_1 <- arima(ts(cut1.train$all_visitors, frequency = 7),order = c(7,1,7))
summary(cut1.arima_1)
##
## Call:
## arima(x = ts(cut1.train$all_visitors, frequency = 7), order = c(7, 1, 7))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1
## -0.3337 -0.3400 -0.3978 -0.2863 -0.4303 -0.2817 0.5956 0.0104
## s.e. 0.0576 0.0568 0.0580 0.0578 0.0595 0.0587 0.0585 0.0357
## ma2 ma3 ma4 ma5 ma6 ma7
## -0.1182 0.1196 -0.1077 0.1099 -0.0441 -0.9698
## s.e. NaN 0.0352 NaN NaN 0.0288 NaN
##
## sigma^2 estimated as 2235114: log likelihood = -2124.1, aic = 4278.21
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -25.98186 1491.951 968.3998 -1.754433 7.61437 0.3465998 0.06467093
checkresiduals(cut1.arima_1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(7,1,7)
## Q* = 19.388, df = 3, p-value = 0.0002273
##
## Model df: 14. Total lags used: 17
cut1.arima_1.fc <- forecast(cut1.arima_1, h=53)
acc.cut1.arima_1 <- accuracy(cut1.arima_1.fc, cut1.test$all_visitors)
# Plot ARIMA (7,1,7)
par(mfrow=c(1,1))
plot(forecast(cut1.arima_1, h=53))
Similarly, we ran an ETS model to adjust for seasonality. Manually, we would see that the error component could be either additive or multiplicative; the trend component would likely be absent, while the seasonal component would be additive. We obtained an optimal (M,N,A) when running the automated ETS model.
# AUTO ETS
cut1.ets <- ets(ts(cut1.train$all_visitors,frequency = 7), model = 'ZZZ')
summary(cut1.ets)
## ETS(M,N,A)
##
## Call:
## ets(y = ts(cut1.train$all_visitors, frequency = 7), model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.3479
## gamma = 1e-04
##
## Initial states:
## l = 14451.5607
## s = -543.4404 -727.7131 -2170.374 -3835.75 -494.0252 4882.413
## 2888.889
##
## sigma: 0.1124
##
## AIC AICc BIC
## 4912.763 4913.711 4947.694
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 9.097243 1701.823 982.1447 -1.869905 8.162042 0.6335257 0.3596927
checkresiduals(cut1.ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,A)
## Q* = 52.576, df = 5, p-value = 4.11e-10
##
## Model df: 9. Total lags used: 14
cut1.ets.fc <- forecast(cut1.ets, h=53)
acc.cut1.ets <- accuracy(cut1.ets.fc,cut1.test$all_visitors)
# Plot Auto ETS
par(mfrow=c(1,1))
plot(forecast(cut1.ets, h=53))
#-----------------------------------------#
# ETS Manual
cut1.ets_1 <- ets(ts(cut1.train$all_visitors,frequency = 7), model = 'MAA')
summary(cut1.ets_1)
## ETS(M,Ad,A)
##
## Call:
## ets(y = ts(cut1.train$all_visitors, frequency = 7), model = "MAA")
##
## Smoothing parameters:
## alpha = 0.3467
## beta = 1e-04
## gamma = 1e-04
## phi = 0.9799
##
## Initial states:
## l = 14811.0925
## b = -47.1556
## s = -543.7205 -727.7204 -2118.895 -3835.223 -435.9855 4883.788
## 2777.756
##
## sigma: 0.1135
##
## AIC AICc BIC
## 4919.701 4921.290 4965.111
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 31.51349 1706.079 991.7598 -1.729896 8.225768 0.6397279 0.359037
checkresiduals(cut1.ets_1)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,Ad,A)
## Q* = 54.935, df = 3, p-value = 7.09e-12
##
## Model df: 12. Total lags used: 15
cut1.ets_1.fc <- forecast(cut1.ets_1, h=53)
acc.cut1.ets_1 <- accuracy(cut1.ets_1.fc,cut1.test$all_visitors)
# Plot ETS Manual
par(mfrow=c(1,1))
plot(forecast(cut1.ets_1, h=53))
We also ran this model in the automated form. Here we obtained different results when forecasting. First, our ACF found no statistical significance with any lags. Though the residuals were generally normally distributed, there were more outliers than the other two models. It still captured the S (frequency) at seven days.
#-----------------------------------------#
# AUTO-NNET
cut1.nnet <- nnetar(ts(cut1.train$all_visitors,frequency = 7))
summary(cut1.nnet)
## Length Class Mode
## x 243 ts numeric
## m 1 -none- numeric
## p 1 -none- numeric
## P 1 -none- numeric
## scalex 2 -none- list
## size 1 -none- numeric
## subset 243 -none- numeric
## model 20 nnetarmodels list
## nnetargs 0 -none- list
## fitted 243 ts numeric
## residuals 243 ts numeric
## lags 16 -none- numeric
## series 1 -none- character
## method 1 -none- character
## call 2 -none- call
checkresiduals(cut1.nnet)
cut1.nnet.fc <- forecast(cut1.nnet, h=53)
acc.cut1.nnet <- accuracy(cut1.nnet.fc,cut1.test$all_visitors)
# Plot Auto ETS
par(mfrow=c(1,1))
plot(forecast(cut1.nnet, h=53))
#-----------------------------------------#
#######
# KEEP ADDING MANUAL MODELS AND INCORPORATE THEM INTO THE COMPARISON TABLE
We then performed manual alterations to the models based on the ACF and PACF previously assessed. The results of the train and test data are below. As we can see, the Neural Net forecast performed best in the training data but quite bad in the test set. We chose the ETS (M,N,A) model because of its low RMSE in the test set compared to the other models.
#-----------------------------------------#
# Compare Models, Auto and Manual
results.cut1 <- matrix(c(acc.cut1.arima[1,2:3],acc.cut1.arima[2,2:3],
acc.cut1.ets[1,2:3],acc.cut1.ets[2,2:3],
acc.cut1.nnet[1,2:3],acc.cut1.nnet[2,2:3],
acc.cut1.arima_1[1,2:3],acc.cut1.arima_1[2,2:3],
acc.cut1.ets_1[1,2:3],acc.cut1.ets_1[2,2:3])
, nrow=5, ncol=4, byrow = T)
colnames(results.cut1)=c("RMSE-train","MAE-train","RMSE-test","MAE-test")
rownames(results.cut1)=c("Auto ARIMA (2,0,3)[7]"
,"Auto ETS (M,N,A)"
,"Neural Net"
,"Manual ARIMA (7,1,7)"
,"Auto ETS (A,Ad,A) Damped")
#as.table(results.cut1)
results.cut1 %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| RMSE-train | MAE-train | RMSE-test | MAE-test | |
|---|---|---|---|---|
| Auto ARIMA (2,0,3)[7] | 1646.7554 | 1099.0706 | 1404.265 | 1073.129 |
| Auto ETS (M,N,A) | 1701.8227 | 982.1447 | 1356.920 | 1052.206 |
| Neural Net | 345.8965 | 231.2837 | 3908.441 | 2962.862 |
| Manual ARIMA (7,1,7) | 1491.9510 | 968.3998 | 2033.491 | 1665.223 |
| Auto ETS (A,Ad,A) Damped | 1706.0789 | 991.7598 | 1372.771 | 1062.731 |
# Model Chosen
cut2 <- d1 %>%
filter(visit_date >= '2016-07-01')
model <- ets(ts(cut2$all_visitors,frequency = 7), model = 'ZZZ')
summary(model)
## ETS(M,N,A)
##
## Call:
## ets(y = ts(cut2$all_visitors, frequency = 7), model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.3451
## gamma = 2e-04
##
## Initial states:
## l = 14498.7956
## s = -674.6785 -915.6947 -2200.504 -3803.51 -479.7646 5053.785
## 3020.367
##
## sigma: 0.1066
##
## AIC AICc BIC
## 6020.765 6021.537 6057.669
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 13.01053 1623.717 954.0923 -1.589591 7.667284 0.6302864 0.3486499
model.fc <- forecast(model, h=53)
autoplot(model.fc) +
labs(y = "Aggregate Visitors"
,title = "Forecast from ETS (M,N,A)"
,subtitle = "Aggregate, from all stores")+
theme_ipsum_es()
Because of my inability to produce a forecast by store using the methods above, I researched how to get several iterations of specific models by store id. I learned about parallel processing and the packages needed to perform such a task. I was able to run auto-ARIMA, auto-ETS, auto-SES, Holts-Winter, and Neural Nets on a look using parallel processing. While I borrowed this from https://www.kaggle.com/code/dongxu027/time-series-ets-nn-arima-ses-lb-0-567, the concept was largely understood.
air_visit$visit_date <- as.Date(parse_date_time(air_visit$visit_date,'%y-%m-%d'))
train_wide <- dcast( #pivots table to apply iterations by store
air_visit, air_store_id ~ visit_date, value.var = "visitors", fill = 0)
train_ts <- ts(train_wide[, 2:dim(train_wide)[2]], frequency = 7)
fcst_intv = 39
fcst_matrix <- matrix(NA,nrow=4*nrow(train_ts),ncol=fcst_intv)
### Forecasting --> about 30 min to run
# registerDoMC(detectCores()-1)
# fcst_matrix <- foreach(i=1:nrow(train_ts),.combine=rbind, .packages=c("forecast")) %do% {
# fcst_ets <- forecast(ets(train_ts[i,]),h=fcst_intv)$mean
# fcst_nnet <- forecast(nnetar(train_ts[i,]),h=fcst_intv)$mean
# fcst_arima <- forecast(auto.arima(train_ts[i,]),h=fcst_intv)$mean
# fcst_ses <- forecast(HoltWinters(train_ts[i,], beta=FALSE, gamma=FALSE),h=fcst_intv)$mean
# fcst_matrix <- rbind(fcst_ets, fcst_nnet, fcst_arima, fcst_ses)
# }
## Prepare for Submission
### post-processing the forecast table
fcst_matrix_mix <- aggregate(fcst_matrix,list(rep(1:(nrow(fcst_matrix)/4),each=4)),mean)[-1]
fcst_matrix_mix[fcst_matrix_mix < 0] <- 0
colnames(fcst_matrix_mix) <- as.character(
seq(from = as.Date("2017-04-23"), to = as.Date("2017-05-31"), by = 'day'))
fcst_df <- as.data.frame(cbind(train_wide[, 1], fcst_matrix_mix))
colnames(fcst_df)[1] <- "air_store_id"
### melt the forecast data frame from wide to long format for final submission
fcst_df_long <- melt(
fcst_df, id = 'air_store_id', variable.name = "fcst_date", value.name = 'visitors')
fcst_df_long$air_store_id <- as.character(fcst_df_long$air_store_id)
fcst_df_long$fcst_date <- as.Date(parse_date_time(fcst_df_long$fcst_date,'%y-%m-%d'))
fcst_df_long$visitors <- as.numeric(fcst_df_long$visitors)
### get & process the sample submission file
sample_sub <- fread("data/sample_submission.csv")
sample_sub$visitors <- NULL
sample_sub$store_id <- substr(sample_sub$id, 1, 20)
sample_sub$visit_date <- substr(sample_sub$id, 22, 31)
sample_sub$visit_date <- as.Date(parse_date_time(sample_sub$visit_date,'%y-%m-%d'))
### generate the final submission file
submission <- left_join(
sample_sub, fcst_df_long, c("store_id" = "air_store_id", 'visit_date' = 'fcst_date'))
submission$visitors[is.na(submission$visitors)] <- 0
final_sub <- select(submission, c('id', 'visitors'))
write.csv(final_sub, "submission_final.csv", row.names = FALSE)
The scores were relatively decent. The public score was 0.56147, and the private score was 0.59167. I suspect that these possibly high scores directly resulted from the model’s several iterations.
There is extensive work to be done on this topic from a personal and technical perspective. First, the learned techniques have only scratched the surface of the potential use in practice. While the concepts were generally understood, more practice on these models needs to be performed. This paper only focused on univariate modeling, but we could have performed multivariate modeling using the techniques reviewed throughout the semester. These limitations bring about less than optimal results in the modeling, though the general idea is not necessarily incorrect but simply insufficient. These models will be utilized constantly and will be published in rpubs.
G. Boomija, A. Anandaraj, S. Nandhini, and S. Lavanya Restaurant Visitor Time Series Forecasting Using Autoregressive Integrated Moving Average Journal of Computational and Theoretical Nanoscience Vol. 15, 1590–1593, 2018.
Mingming Hu, Richard T.R. Qiu, Doris Chenguang Wu, Haiyan Song, Hierarchical pattern recognition for tourism demand forecasting, Tourism Management, Volume 84, 2021, 104263,ISSN 0261-5177, https://doi.org/10.1016/j.tourman.2020.104263.
Fong-Lin Chu Forecasting tourism demand with ARMA-based methods Tourism Management, 30 (5) (2009), pp. 740-751.
Dennis Reynolds, Imran Rahman, William Balinbin, Econometric modeling of the U.S. restaurant industry, International Journal of Hospitality Management, Volume 34, 2013, Pages 317-323, ISSN 0278-4319 https://doi.org/10.1016/j.ijhm.2013.04.003.
Richard T.R. Qiu, Doris Chenguang Wu, Vincent Dropsy, Sylvain Petit, Stephen Pratt, Yasuo Ohe, Visitor arrivals forecast amid COVID-19: A perspective from the Asia and Pacific team, Annals of Tourism Research, Volume 88, 2021, 103155, ISSN 0160-7383. https://doi.org/10.1016/j.annals.2021.103155.