Introduction

This paper aims to show what was learned in this course and provide a potential solution to some of the problems many restaurants face in forecasting visitor traffic. The exercise relies on Kaggle’s competition under “Recruit Restaurant Visitor Forecasting,” which takes data from two different sources, similar to Open Table or Resy in the United States. The specific goal is to submit a forecast with the data provided after exploring and transforming the necessary data. To submit a forecast, one has to perform several time-series analyses using different models, such as autoregressive integrated moving averages (ARIMA), exponential smoothing (ETS), and neural networks (NNA).

Problem and Significance

Empirically, many people know that predicting potential visitor traffic to a restaurant is probably impossible. However, by analyzing the available data, one can at least estimate potential traffic. Working in the food industry for the past 16 years as a price analyst, my experience with many in the foodservice (restaurant) sector is that it is not uncommon for restaurants to provide faulty demand forecasts to their suppliers. As a result, restaurants could have too much or too little food at any given time, and thus potentially increase costs. The same can happen to other resources, like personnel.

The approach taken in this paper focuses on univariate analyses and does not consider explanatory variables such as weather, incomes, etc.

Reviews

In some of our approaches, we focused on the seasonal patterns and the results used by some metrics, such as RMSE (G. Boomija, A. Anandaraj, S. Nandhini, and S. Lavanya, 2016). On the seasonality component, when performing an ARIMA, we also set the frequency at 7 (Mingming Hu, Richard T.R. Qiu, Doris Chenguang Wu, Haiyan Song, 2021). Many papers suggest taking an ARMA-based approach initially and then testing whether they could be integrated through an ARIMA or SARIMAX model (Fong-Lin Chu, 2009). However, many references utilize univariate time series using other models, such as ETS and simple NNK. However, other papers incorporate exogenous variables, such as income (Reynold, Rahman, Balinbinc, 2013). ETS models performed better in univariate, single models than Naïve or SARIMA (Qiu, Wu, Dropsy, Petit, Pratt, Ohe, 2021). There is an insurmountable amount of literature about forecast modeling in the tourism, restaurants, and hospitality industries. For now, we focused on what was covered throughout the semester.

The Data

Describing the data

air_visit_data.csv: historical data on visitors per restaurant from the AirREGI dataset. We learned that this will be our training data throughout the data manipulation since it is the only dataset with visitor information.

air_reserve.csv: historical data on reservations from the AirREGI dataset.

hpg_reserve.csv: historical data on reservations for the HPG (Hot Pepper Gourmet) dataset.

air_store_info.csv: Information about the genre (type of cuisines) and location of each store/restaurant in the AirREGI dataset.

hpg_store_info.csv: Information about the genre (type of cuisines) and location of each store/restaurant in the HPG dataset. store_id_relation.csv  connects the codes for each restaurant for both the AirREGI and HPG datasets.

date_info.csv Provides information on holiday dates in Japan. sample_submission.csv  This is the submission file. It helps structure the deliverable for final submission, matching store ids and visitor forecasts.

There were significant transformations to the datasets to set up the data frames to the required format. Dates were set for relatively easy manipulation, but using “ts” or “tsibbles” between fable and the forecast packages became a massive challenge throughout the semester. However, I was able to borrow from multiple notebooks, and in the end, I was partially successful in manipulating the data in some of the desired ways.

Bring in the data

Load the datasets.

# Air
air_res_tib <- as.tibble(fread('data/air_reserve.csv'))
air_store_tib <- as.tibble(fread('data/air_store_info.csv'))
air_visits_tib <- as.tibble(fread('data/air_visit_data.csv'))

# HPG
hpg_res_tib <- as.tibble(fread('data/hpg_reserve.csv'))
hpg_store_tib <- as.tibble(fread('data/hpg_store_info.csv'))

# other to melt
dates_tib <- as.tibble(fread('data/date_info.csv'))
store_id_rel_tib <- as.tibble(fread('data/store_id_relation.csv'))

# Submission - 
test_tib <- as.tibble(fread('data/sample_submission.csv'))

Data Structure

From the initial discoveries, we notice that the earliest day was January 1st, 2016, and it ended on April 22nd, 2017.

The number of visitors varies widely, from 1 to 877 as the max, but the median is around 17 visitors at any given day per store/location.

We also found 829 “unique” stores or codes per restaurant.

Reservations from the AirREGI dataset have a maximum of 100 per day per store.

# Air visits

knitr::kable(summary(air_visits_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")

air_store_id	visit_date	visitors
Length:252108	Min. :2016-01-01	Min. : 1.00
Class :character	1st Qu.:2016-07-23	1st Qu.: 9.00
Mode :character	Median :2016-10-23	Median : 17.00
**	Mean :2016-10-12	Mean : 20.97
**	3rd Qu.:2017-01-24	3rd Qu.: 29.00
**	Max. :2017-04-22	Max. :877.00

glimpse(air_visits_tib)

## Rows: 252,108
## Columns: 3
## $ air_store_id <chr> "air_ba937bf13d40fb24", "air_ba937bf13d40fb24", "air_ba93~
## $ visit_date   <date> 2016-01-13, 2016-01-14, 2016-01-15, 2016-01-16, 2016-01-~
## $ visitors     <int> 25, 32, 29, 22, 6, 9, 31, 21, 18, 26, 21, 11, 24, 21, 26,~

nrows_air_v <- air_visits_tib %>%
  distinct(air_store_id) %>%
  nrow()

knitr::kable(paste0("the number of distinct stores is ", nrows_air_v), "html")%>%
kable_styling("striped", position = "left")

x
the number of distinct stores is 829

#-----------------------------------------
# Air reservations

knitr::kable(summary(air_res_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")

air_store_id	visit_datetime	reserve_datetime	reserve_visitors
Length:92378	Min. :2016-01-01 19:00:00	Min. :2016-01-01 01:00:00	Min. : 1.000
Class :character	1st Qu.:2016-11-15 19:00:00	1st Qu.:2016-11-07 17:00:00	1st Qu.: 2.000
Mode :character	Median :2017-01-05 18:00:00	Median :2016-12-27 22:00:00	Median : 3.000
**	Mean :2016-12-05 08:18:58	Mean :2016-11-27 01:13:07	Mean : 4.482
**	3rd Qu.:2017-03-03 19:00:00	3rd Qu.:2017-02-26 18:00:00	3rd Qu.: 5.000
**	Max. :2017-05-31 21:00:00	Max. :2017-04-22 23:00:00	Max. :100.000

glimpse(air_res_tib)

## Rows: 92,378
## Columns: 4
## $ air_store_id     <chr> "air_877f79706adbfb06", "air_db4b38ebe7a7ceff", "air_~
## $ visit_datetime   <dttm> 2016-01-01 19:00:00, 2016-01-01 19:00:00, 2016-01-01~
## $ reserve_datetime <dttm> 2016-01-01 16:00:00, 2016-01-01 19:00:00, 2016-01-01~
## $ reserve_visitors <int> 1, 3, 6, 2, 5, 2, 4, 2, 2, 2, 3, 3, 2, 6, 7, 41, 13, ~

nrows_air_res <- air_res_tib %>%
  distinct(air_store_id) %>%
  nrow()

knitr::kable(paste0("the number of rows is ", nrows_air_res), "html")%>%
kable_styling("striped", position = "left")

x
the number of rows is 314

#-----------------------------------------
# Air Stores

knitr::kable(summary(air_store_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")

air_store_id	air_genre_name	air_area_name	latitude	longitude
Length:829	Length:829	Length:829	Min. :33.21	Min. :130.2
Class :character	Class :character	Class :character	1st Qu.:34.70	1st Qu.:135.3
Mode :character	Mode :character	Mode :character	Median :35.66	Median :139.7
**	**	**	Mean :35.65	Mean :137.4
**	**	**	3rd Qu.:35.69	3rd Qu.:139.8
**	**	**	Max. :44.02	Max. :144.3

glimpse(air_store_tib)

## Rows: 829
## Columns: 5
## $ air_store_id   <chr> "air_0f0cdeee6c9bf3d7", "air_7cc17a324ae5c7dc", "air_fe~
## $ air_genre_name <chr> "Italian/French", "Italian/French", "Italian/French", "~
## $ air_area_name  <chr> "HyÅ\215go-ken KÅ\215be-shi KumoidÅ\215ri", "HyÅ\215go-~
## $ latitude       <dbl> 34.69512, 34.69512, 34.69512, 34.69512, 35.65807, 35.65~
## $ longitude      <dbl> 135.1979, 135.1979, 135.1979, 135.1979, 139.7516, 139.7~

nrows_air_st <- air_store_tib %>%
  distinct(air_store_id) %>%
  nrow()

knitr::kable(paste0("the number of rows is ", nrows_air_st), "html")%>%
kable_styling("striped", position = "left")

x
the number of rows is 829

#-----------------------------------------
# HPG reservations

knitr::kable(summary(hpg_res_tib), "html",options(knitr.kable.NA = "**")) %>%
kable_styling("striped", position = "left")

hpg_store_id	visit_datetime	reserve_datetime	reserve_visitors
Length:2000320	Min. :2016-01-01 11:00:00	Min. :2016-01-01 00:00:00	Min. : 1.000
Class :character	1st Qu.:2016-06-26 19:00:00	1st Qu.:2016-06-21 12:00:00	1st Qu.: 2.000
Mode :character	Median :2016-11-19 20:00:00	Median :2016-11-10 20:00:00	Median : 3.000
**	Mean :2016-10-15 06:55:20	Mean :2016-10-07 19:57:59	Mean : 5.074
**	3rd Qu.:2017-02-03 19:00:00	3rd Qu.:2017-01-27 13:00:00	3rd Qu.: 6.000
**	Max. :2017-05-31 23:00:00	Max. :2017-04-22 23:00:00	Max. :100.000

glimpse(hpg_res_tib)

## Rows: 2,000,320
## Columns: 4
## $ hpg_store_id     <chr> "hpg_c63f6f42e088e50f", "hpg_dac72789163a3f47", "hpg_~
## $ visit_datetime   <dttm> 2016-01-01 11:00:00, 2016-01-01 13:00:00, 2016-01-01~
## $ reserve_datetime <dttm> 2016-01-01 09:00:00, 2016-01-01 06:00:00, 2016-01-01~
## $ reserve_visitors <int> 1, 3, 2, 5, 13, 2, 2, 2, 2, 6, 2, 2, 2, 2, 5, 4, 2, 4~

nrows_hpg_res <- air_store_tib %>%
  distinct(air_store_id) %>%
  nrow()

knitr::kable(paste0("the number of rows is ", nrows_hpg_res), "html")%>%
kable_styling("striped", position = "left")

x
the number of rows is 829

#-----------------------------------------

Reformat Dates

This is the code for formatting the dates for almost all purposes.

# Re-format the dates. 
# Although the tibbles are already reading as tables
# we need to re-format further
# from be my guest 

# Air dataset

air_visits_tib <- air_visits_tib %>%
  mutate(visit_date = ymd(visit_date))

air_res_tib <- air_res_tib %>%
  mutate(visit_datetime = ymd_hms(visit_datetime),
         reserve_datetime = ymd_hms(reserve_datetime))

air_store_tib <- air_store_tib %>%
  mutate(air_genre_name = as.factor(air_genre_name),
         air_area_name = as.factor(air_area_name))

# Dates and Holidays
dates_tib <- dates_tib %>%
  mutate(holiday_flg = as.logical(holiday_flg),
         date = ymd(calendar_date))

# Holidays and air visits (since it will be the target variable)
holidays <- dates_tib %>% filter(holiday_flg == 1)
air_visits_tib$visit_date <- as.Date(air_visits_tib$visit_date)
air_visits_tib$is_holiday <- factor(ifelse(air_visits_tib$visit_date %in% holidays$calendar_date, 1, 0))
holiday <-

# HPG dataset

hpg_res_tib <- hpg_res_tib %>%
  mutate(visit_datetime = ymd_hms(visit_datetime),
         reserve_datetime = ymd_hms(reserve_datetime))

hpg_store_tib <- hpg_store_tib %>%
  mutate(hpg_genre_name = as.factor(hpg_genre_name),
         hpg_area_name = as.factor(hpg_area_name))

Visualization

We had to create an aggregate dataset of all restaurants to assess modeling. Because there are 829 different stores, it would be nearly impossible to assess store by store on their traffic throughout the period analyzed. Instead, we approached this data from the sum of all stores by day. From just a visual point of view, we found a vital seasonal component, naturally, from a daily perspective.

As seen from the figure above, there appears to be a structural change in July 2016. We would have to control for those changes when modeling. We then looked at the overall distribution of visitors for daily visits. While they are somewhat normally distributed, we logged the data to scale it and get a more apparent normal distribution.

We then looked at the average visitors by day—not the median. The purpose of this exercise is to try to identify any patterns. Not surprisingly, we found that weekends are usually the highest traffic peaking on Saturday.

We also looked at visitors monthly, and unsurprisingly, December tends to be the month with the highest visitor traffic.

Finally, we noticed many more stores registered under Hot Pepper Gourmet than AirREGI in terms of genre. In other words, there are more categories and more stores reporting information to the HPG database.

# air visits - Line Graph (structural change)

air_visits_tib %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors), .groups = 'drop') %>%
  ggplot(aes(visit_date, all_visitors)) +
  geom_line(col = "dodgerblue3", size = 0.7
            ) +
  geom_smooth(method = "loess", col = "orangered", size = 0.5, span =1/7, formula = y ~ x) +
  labs(x = "All visitors"
       , y = "Date"
       , title = "Visitors by day"
       , subtitle = "AirReg") +
  theme_ipsum_es()

# Identify Holidays

d1 <- air_visits_tib %>% 
  group_by(visit_date, is_holiday) %>% 
  summarise(all_visitors = sum(visitors), .groups = 'drop')

d2 <- d1 %>% filter(is_holiday == 1)
ggplot() +
  geom_point(aes(x = visit_date, y = all_visitors), data = d2, color = 'orangered', size = 2) +
  geom_line(aes(x = visit_date, y = all_visitors), data = d1,color = 'dodgerblue3', alpha = 0.6) +
  labs(y = 'All visitors', x = 'Date',
       title = 'Total Visits by date'
       , subtitle = "AirReg") +
  theme_ipsum_es() +
  scale_x_date(date_breaks = '1 month') +
  theme(axis.text.x = element_text(angle=90, hjust = 1))

# Distribution of visits not as tibble, but as vector
#air_visits_nt <- fread('data/air_visit_data.csv') #nt = not tibble

air_visits_tib %>%                      
  group_by(visit_date) %>%
  summarise(total_visits = sum(visitors), .groups = 'drop') %>%
  ggplot(aes(x = total_visits)) +
  geom_histogram(fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
  theme_ipsum_es() +
  labs(x = 'Daily visits'
       ,y = "Number of Visitors"
       ,title = 'Distribution of daily visits, not logged'
       ,subtitle = "AirReg")

# Distribution of visits, logged
# Needed to do the log transformation to get a normal distribution
air_visits_tib %>%
  ggplot(aes(visitors)) +
  geom_vline(xintercept = 20, color = "dodgerblue", size = 2) +
  geom_histogram(fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
  scale_x_log10() +
  labs(x = 'Daily visits'
       ,title = 'Distribution of daily visits, logged'
       ,subtitle = "AirReg") +
  theme_ipsum_es()

# by day of the week
air_visits_tib %>%
  mutate(wday = weekdays(visit_date)) %>%
  group_by(wday) %>%
  summarise(visits = mean(visitors), .groups = 'drop') %>%
  ggplot(aes(wday, visits, fill = wday)) +
  theme_ipsum_es() +
  geom_col(show.legend = FALSE, fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) +
  theme(legend.position = "none", axis.text.x  = element_text(angle=90, hjust=1, vjust=1)) +
  labs(x = "Day of the week"
       , y = "Average visitors"
       , title = "Average Visitors by Weekday"
       , subtitle = "AirReg")

# by month

air_visits_tib %>%
  mutate(month = months(visit_date)) %>%
  group_by(month) %>%
  summarise(visits = mean(visitors), .groups = 'drop') %>%
  ggplot(aes(month, visits, fill = month)) +
  theme_ipsum_es() +
  geom_col(show.legend = FALSE, fill = "orangered", col=I("orangered"), bins = 20, alpha = 0.5, size = 1) + 
  scale_x_discrete(limits = month.name) +

  theme(legend.position = "none", axis.text.x  = element_text(angle=90, hjust=1, vjust=1))+
  labs(x = "Month Number"
       , y = "Average Visitors"
       , title = "Monthly Average Restaurant Visitors"
       , subtitle = "AirReg")

# Target Variable

air_visits_tib %>%
  filter(visit_date > ymd("2016-04-15") & visit_date < ymd("2016-06-15")) %>%
  group_by(visit_date) %>%
  summarise(all_visitors = sum(visitors), .groups = 'drop') %>%
  ggplot(aes(visit_date,all_visitors)) +
  geom_line(color = 'dodgerblue3', size = 1) +
  geom_smooth(method = "loess", color = "orangered", size = 0.5, span = 1/7,formula = y ~ x) + #1/7 for the 7 days for weekdays
  labs(x = "All visitors"
       , y = "Date"
       , title = " All visitors"
       , subtitle = "AirReg") +
  theme_ipsum_es()

# HPG vs AirReg

t1 <- air_store_tib %>% 
  group_by(air_genre_name) %>% 
  summarise(N = n(), .groups = 'drop') %>% 
  ggplot(aes(x = fct_reorder(air_genre_name, N), y = N)) +
  geom_bar(stat= 'identity'
           , fill = 'orangered'
           , col=I("orangered")
           , size = 0.5
           , alpha = 0.5) +
  labs(x = ''
       , y = 'Number of stores'
       , title = 'AirREGI') + 
  coord_flip() +
  theme_bw()

t2 <- hpg_store_tib %>% 
  group_by(hpg_genre_name) %>% 
  summarise(N = n(), .groups = 'drop') %>% 
  ggplot(aes(x = fct_reorder(hpg_genre_name, N), y = N)) +
  geom_bar(stat= 'identity'
           , fill = 'dodgerblue3'
           , col=I("dodgerblue3")
           , size = 0.5
           , alpha = 0.5) +
  labs(x = ''
       , y = 'Number of stores'
       , title = 'HP Gourmet', size = 5) + 
  coord_flip() +
  theme_bw()

grid.arrange(t2, t1, ncol = 2)

# map

leaflet(air_store_tib) %>%
  addTiles() %>%
  addProviderTiles("CartoDB.Positron") %>%
  addMarkers(~longitude, ~latitude,
             popup = ~air_store_id, label = ~air_genre_name,
             clusterOptions = markerClusterOptions())

Technicals

When decomposing the aggregated data mentioned above, we noticed that our ACF declined gradually, and our PACF dropped suddenly after the 8th lag. We ran a KPSS test to assess stationarity and see the differences needed.

Once differenced, our KPSS test confirmed that one difference would suffice. Here our ACF and PACF showed strong patterns, which can help us assess some of the parameters in our modeling later on. The ACF showed statistically significant correlations not only every seven days but every three too.

### Stationarity, PACF and ACF

# set up aggregate

air_agg_ts <- air_visits_tib %>%
  group_by(visit_date, is_holiday) %>% 
  summarise(all_visitors = sum(visitors), .groups="drop") %>%
  as_tsibble(index = visit_date)

# ACF and PACF

air_visit_ts <- ts(subset(air_agg_ts, select = "all_visitors", frequency =7))
ggtsdisplay(air_visit_ts, main = "AirReg Visits")

# Differences

air_agg_ts %>%
  gg_tsdisplay(difference(all_visitors), plot_type = 'partial') +
  labs(title = "AirReg visitors Differenced (1)")

# Check for stationarity

air_agg_ts$all_visitors %>%
  unitroot_kpss() %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	x
kpss_stat	6.380794
kpss_pvalue	0.010000

# One Difference
air_agg_ts$all_visitors %>%
  diff() %>%
  unitroot_kpss()%>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	x
kpss_stat	0.0175277
kpss_pvalue	0.1000000

Models

We then started running the models. We started with automated results from the “forecast” package. But before doing that, we arbitrarily decided to run the model after the structural change seen in July 2016. In other words, our train dataset cut date began after July 2016.

# Borrowed from: https://www.kaggle.com/code/mymy610/forecasting-models-arima-ets
# set dates for train and test

cut1 <- air_agg_ts %>% 
  filter(visit_date >='2016-07-01')

par(mfrow=c(1,1))
plot(cut1$all_visitors, type = 'l', main = "Train data post July 2016")

cut1.train <- cut1 %>%
  filter(visit_date < '2017-03-01')
cut1.test <- cut1 %>% 
  filter(visit_date >= '2017-03-01')

#-----------------------------------------#
# Check ACF and PACF, again

par(mfrow=c(1,2))
acf(cut1$all_visitors, main = "ACF post July '16")
pacf(cut1$all_visitors,main = "PACF post July '16")

Auto ARIMA and Manual

The results from the auto ARIMA showed our p, d, q components at 2, 0, 3, respectively. This result is consistent with the visual and empirical assessment made prior using the ACF and PACF. Our residuals were also normally distributed with only a few outliers. When plotting the forecast, it appeared, in many ways, reasonable. Our manual models use diffrent p, d, q parameters with diffrent results.

#-----------------------------------------#
# AUTO ARIMA
cut1.arima <- auto.arima(ts(cut1.train$all_visitors,frequency = 7))
summary(cut1.arima)

## Series: ts(cut1.train$all_visitors, frequency = 7) 
## ARIMA(2,0,3)(2,1,0)[7] 
## 
## Coefficients:
##           ar1     ar2     ma1      ma2      ma3     sar1     sar2
##       -0.3419  0.6301  1.1406  -0.0159  -0.1910  -0.4912  -0.3604
## s.e.   0.1669  0.1583  0.1885   0.3109   0.1417   0.0626   0.0610
## 
## sigma^2 estimated as 2877590:  log likelihood=-2088.19
## AIC=4192.38   AICc=4193.01   BIC=4220.09
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE     MASE        ACF1
## Training set 8.187886 1646.755 1099.071 -1.35863 8.691962 0.708948 0.004408588

checkresiduals(cut1.arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,3)(2,1,0)[7]
## Q* = 13.654, df = 7, p-value = 0.05769
## 
## Model df: 7.   Total lags used: 14

cut1.arima.fc <- forecast(cut1.arima, h=53)
acc.cut1.arima <- accuracy(cut1.arima.fc, cut1.test$all_visitors)

# Plot AUTO ARIMA
par(mfrow=c(1,1))
plot(forecast(cut1.arima, h=53))

#-----------------------------------------#
# ARIMA, (7,1,7)

cut1.arima_1 <- arima(ts(cut1.train$all_visitors, frequency = 7),order = c(7,1,7))

summary(cut1.arima_1)

## 
## Call:
## arima(x = ts(cut1.train$all_visitors, frequency = 7), order = c(7, 1, 7))
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5      ar6     ar7     ma1
##       -0.3337  -0.3400  -0.3978  -0.2863  -0.4303  -0.2817  0.5956  0.0104
## s.e.   0.0576   0.0568   0.0580   0.0578   0.0595   0.0587  0.0585  0.0357
##           ma2     ma3      ma4     ma5      ma6      ma7
##       -0.1182  0.1196  -0.1077  0.1099  -0.0441  -0.9698
## s.e.      NaN  0.0352      NaN     NaN   0.0288      NaN
## 
## sigma^2 estimated as 2235114:  log likelihood = -2124.1,  aic = 4278.21
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE    MAPE      MASE       ACF1
## Training set -25.98186 1491.951 968.3998 -1.754433 7.61437 0.3465998 0.06467093

checkresiduals(cut1.arima_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(7,1,7)
## Q* = 19.388, df = 3, p-value = 0.0002273
## 
## Model df: 14.   Total lags used: 17

cut1.arima_1.fc <- forecast(cut1.arima_1, h=53)
acc.cut1.arima_1 <- accuracy(cut1.arima_1.fc, cut1.test$all_visitors)

# Plot ARIMA (7,1,7)
par(mfrow=c(1,1))
plot(forecast(cut1.arima_1, h=53))

Auto ETS and Manual

Similarly, we ran an ETS model to adjust for seasonality. Manually, we would see that the error component could be either additive or multiplicative; the trend component would likely be absent, while the seasonal component would be additive. We obtained an optimal (M,N,A) when running the automated ETS model.

# AUTO ETS
cut1.ets <- ets(ts(cut1.train$all_visitors,frequency = 7), model = 'ZZZ')
summary(cut1.ets)

## ETS(M,N,A) 
## 
## Call:
##  ets(y = ts(cut1.train$all_visitors, frequency = 7), model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.3479 
##     gamma = 1e-04 
## 
##   Initial states:
##     l = 14451.5607 
##     s = -543.4404 -727.7131 -2170.374 -3835.75 -494.0252 4882.413
##            2888.889
## 
##   sigma:  0.1124
## 
##      AIC     AICc      BIC 
## 4912.763 4913.711 4947.694 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 9.097243 1701.823 982.1447 -1.869905 8.162042 0.6335257 0.3596927

checkresiduals(cut1.ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,A)
## Q* = 52.576, df = 5, p-value = 4.11e-10
## 
## Model df: 9.   Total lags used: 14

cut1.ets.fc <- forecast(cut1.ets, h=53)
acc.cut1.ets <- accuracy(cut1.ets.fc,cut1.test$all_visitors)

# Plot Auto ETS
par(mfrow=c(1,1))
plot(forecast(cut1.ets, h=53))

#-----------------------------------------#
# ETS Manual
cut1.ets_1 <- ets(ts(cut1.train$all_visitors,frequency = 7), model = 'MAA')
summary(cut1.ets_1)

## ETS(M,Ad,A) 
## 
## Call:
##  ets(y = ts(cut1.train$all_visitors, frequency = 7), model = "MAA") 
## 
##   Smoothing parameters:
##     alpha = 0.3467 
##     beta  = 1e-04 
##     gamma = 1e-04 
##     phi   = 0.9799 
## 
##   Initial states:
##     l = 14811.0925 
##     b = -47.1556 
##     s = -543.7205 -727.7204 -2118.895 -3835.223 -435.9855 4883.788
##            2777.756
## 
##   sigma:  0.1135
## 
##      AIC     AICc      BIC 
## 4919.701 4921.290 4965.111 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE     ACF1
## Training set 31.51349 1706.079 991.7598 -1.729896 8.225768 0.6397279 0.359037

checkresiduals(cut1.ets_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,Ad,A)
## Q* = 54.935, df = 3, p-value = 7.09e-12
## 
## Model df: 12.   Total lags used: 15

cut1.ets_1.fc <- forecast(cut1.ets_1, h=53)
acc.cut1.ets_1 <- accuracy(cut1.ets_1.fc,cut1.test$all_visitors)

# Plot ETS Manual
par(mfrow=c(1,1))
plot(forecast(cut1.ets_1, h=53))

Neural Network

We also ran this model in the automated form. Here we obtained different results when forecasting. First, our ACF found no statistical significance with any lags. Though the residuals were generally normally distributed, there were more outliers than the other two models. It still captured the S (frequency) at seven days.

#-----------------------------------------#
# AUTO-NNET
cut1.nnet <- nnetar(ts(cut1.train$all_visitors,frequency = 7))
summary(cut1.nnet)

##           Length Class        Mode     
## x         243    ts           numeric  
## m           1    -none-       numeric  
## p           1    -none-       numeric  
## P           1    -none-       numeric  
## scalex      2    -none-       list     
## size        1    -none-       numeric  
## subset    243    -none-       numeric  
## model      20    nnetarmodels list     
## nnetargs    0    -none-       list     
## fitted    243    ts           numeric  
## residuals 243    ts           numeric  
## lags       16    -none-       numeric  
## series      1    -none-       character
## method      1    -none-       character
## call        2    -none-       call

checkresiduals(cut1.nnet)

cut1.nnet.fc <- forecast(cut1.nnet, h=53)
acc.cut1.nnet <- accuracy(cut1.nnet.fc,cut1.test$all_visitors)

# Plot Auto ETS
par(mfrow=c(1,1))
plot(forecast(cut1.nnet, h=53))

#-----------------------------------------#
#######
# KEEP ADDING MANUAL MODELS AND INCORPORATE THEM INTO THE COMPARISON TABLE

The Results

We then performed manual alterations to the models based on the ACF and PACF previously assessed. The results of the train and test data are below. As we can see, the Neural Net forecast performed best in the training data but quite bad in the test set. We chose the ETS (M,N,A) model because of its low RMSE in the test set compared to the other models.

#-----------------------------------------#
# Compare Models, Auto and Manual
results.cut1 <- matrix(c(acc.cut1.arima[1,2:3],acc.cut1.arima[2,2:3],
                         acc.cut1.ets[1,2:3],acc.cut1.ets[2,2:3],
                         acc.cut1.nnet[1,2:3],acc.cut1.nnet[2,2:3],
                         acc.cut1.arima_1[1,2:3],acc.cut1.arima_1[2,2:3],
                         acc.cut1.ets_1[1,2:3],acc.cut1.ets_1[2,2:3])
                       , nrow=5, ncol=4, byrow = T)

colnames(results.cut1)=c("RMSE-train","MAE-train","RMSE-test","MAE-test")

rownames(results.cut1)=c("Auto ARIMA (2,0,3)[7]"
                         ,"Auto ETS (M,N,A)"
                         ,"Neural Net"
                         ,"Manual ARIMA (7,1,7)"
                         ,"Auto ETS (A,Ad,A) Damped")
#as.table(results.cut1)

results.cut1 %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))

	RMSE-train	MAE-train	RMSE-test	MAE-test
Auto ARIMA (2,0,3)[7]	1646.7554	1099.0706	1404.265	1073.129
Auto ETS (M,N,A)	1701.8227	982.1447	1356.920	1052.206
Neural Net	345.8965	231.2837	3908.441	2962.862
Manual ARIMA (7,1,7)	1491.9510	968.3998	2033.491	1665.223
Auto ETS (A,Ad,A) Damped	1706.0789	991.7598	1372.771	1062.731

Chosen Model

# Model Chosen

cut2 <- d1 %>% 
  filter(visit_date >= '2016-07-01')

model <- ets(ts(cut2$all_visitors,frequency = 7), model = 'ZZZ')

summary(model)

## ETS(M,N,A) 
## 
## Call:
##  ets(y = ts(cut2$all_visitors, frequency = 7), model = "ZZZ") 
## 
##   Smoothing parameters:
##     alpha = 0.3451 
##     gamma = 2e-04 
## 
##   Initial states:
##     l = 14498.7956 
##     s = -674.6785 -915.6947 -2200.504 -3803.51 -479.7646 5053.785
##            3020.367
## 
##   sigma:  0.1066
## 
##      AIC     AICc      BIC 
## 6020.765 6021.537 6057.669 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 13.01053 1623.717 954.0923 -1.589591 7.667284 0.6302864 0.3486499

model.fc <- forecast(model, h=53)
autoplot(model.fc) +
  labs(y = "Aggregate Visitors"
       ,title = "Forecast from ETS (M,N,A)"
       ,subtitle = "Aggregate, from all stores")+
  theme_ipsum_es()

Parallel Processing (PP)

Because of my inability to produce a forecast by store using the methods above, I researched how to get several iterations of specific models by store id. I learned about parallel processing and the packages needed to perform such a task. I was able to run auto-ARIMA, auto-ETS, auto-SES, Holts-Winter, and Neural Nets on a look using parallel processing. While I borrowed this from https://www.kaggle.com/code/dongxu027/time-series-ets-nn-arima-ses-lb-0-567, the concept was largely understood.

Data Wrangling

air_visit$visit_date <- as.Date(parse_date_time(air_visit$visit_date,'%y-%m-%d'))
train_wide <- dcast( #pivots table to apply iterations by store
  air_visit, air_store_id ~ visit_date, value.var = "visitors", fill = 0)
train_ts <- ts(train_wide[, 2:dim(train_wide)[2]], frequency = 7) 
fcst_intv = 39  
fcst_matrix <- matrix(NA,nrow=4*nrow(train_ts),ncol=fcst_intv)

Models PP

### Forecasting --> about 30 min to run
# registerDoMC(detectCores()-1)
# fcst_matrix <- foreach(i=1:nrow(train_ts),.combine=rbind, .packages=c("forecast")) %do% { 
#   fcst_ets <- forecast(ets(train_ts[i,]),h=fcst_intv)$mean
#   fcst_nnet <- forecast(nnetar(train_ts[i,]),h=fcst_intv)$mean
#   fcst_arima <- forecast(auto.arima(train_ts[i,]),h=fcst_intv)$mean
#   fcst_ses <- forecast(HoltWinters(train_ts[i,], beta=FALSE, gamma=FALSE),h=fcst_intv)$mean
#   fcst_matrix <- rbind(fcst_ets, fcst_nnet, fcst_arima, fcst_ses)
# }

Prepare for Submission

## Prepare for Submission

### post-processing the forecast table
fcst_matrix_mix <- aggregate(fcst_matrix,list(rep(1:(nrow(fcst_matrix)/4),each=4)),mean)[-1]
fcst_matrix_mix[fcst_matrix_mix < 0] <- 0
colnames(fcst_matrix_mix) <- as.character(
  seq(from = as.Date("2017-04-23"), to = as.Date("2017-05-31"), by = 'day'))
fcst_df <- as.data.frame(cbind(train_wide[, 1], fcst_matrix_mix)) 
colnames(fcst_df)[1] <- "air_store_id"

### melt the forecast data frame from wide to long format for final submission
fcst_df_long <- melt(
  fcst_df, id = 'air_store_id', variable.name = "fcst_date", value.name = 'visitors')
fcst_df_long$air_store_id <- as.character(fcst_df_long$air_store_id)
fcst_df_long$fcst_date <- as.Date(parse_date_time(fcst_df_long$fcst_date,'%y-%m-%d'))
fcst_df_long$visitors <- as.numeric(fcst_df_long$visitors)

### get & process the sample submission file
sample_sub <- fread("data/sample_submission.csv")
sample_sub$visitors <- NULL
sample_sub$store_id <- substr(sample_sub$id, 1, 20)
sample_sub$visit_date <- substr(sample_sub$id, 22, 31)
sample_sub$visit_date <- as.Date(parse_date_time(sample_sub$visit_date,'%y-%m-%d'))

### generate the final submission file
submission <- left_join(
  sample_sub, fcst_df_long, c("store_id" = "air_store_id", 'visit_date' = 'fcst_date'))
submission$visitors[is.na(submission$visitors)] <- 0
final_sub <- select(submission, c('id', 'visitors'))
write.csv(final_sub, "submission_final.csv", row.names = FALSE)

Score & Discuss

The scores were relatively decent. The public score was 0.56147, and the private score was 0.59167. I suspect that these possibly high scores directly resulted from the model’s several iterations.

Final Comments

Future Work, Limitations, and Learned Experiences

There is extensive work to be done on this topic from a personal and technical perspective. First, the learned techniques have only scratched the surface of the potential use in practice. While the concepts were generally understood, more practice on these models needs to be performed. This paper only focused on univariate modeling, but we could have performed multivariate modeling using the techniques reviewed throughout the semester. These limitations bring about less than optimal results in the modeling, though the general idea is not necessarily incorrect but simply insufficient. These models will be utilized constantly and will be published in rpubs.

References

G. Boomija, A. Anandaraj, S. Nandhini, and S. Lavanya Restaurant Visitor Time Series Forecasting Using Autoregressive Integrated Moving Average Journal of Computational and Theoretical Nanoscience Vol. 15, 1590–1593, 2018.

Mingming Hu, Richard T.R. Qiu, Doris Chenguang Wu, Haiyan Song, Hierarchical pattern recognition for tourism demand forecasting, Tourism Management, Volume 84, 2021, 104263,ISSN 0261-5177, https://doi.org/10.1016/j.tourman.2020.104263.

Fong-Lin Chu Forecasting tourism demand with ARMA-based methods Tourism Management, 30 (5) (2009), pp. 740-751.

Dennis Reynolds, Imran Rahman, William Balinbin, Econometric modeling of the U.S. restaurant industry, International Journal of Hospitality Management, Volume 34, 2013, Pages 317-323, ISSN 0278-4319 https://doi.org/10.1016/j.ijhm.2013.04.003.

Richard T.R. Qiu, Doris Chenguang Wu, Vincent Dropsy, Sylvain Petit, Stephen Pratt, Yasuo Ohe, Visitor arrivals forecast amid COVID-19: A perspective from the Asia and Pacific team, Annals of Tourism Research, Volume 88, 2021, 103155, ISSN 0160-7383. https://doi.org/10.1016/j.annals.2021.103155.

Restaurant Visitor Forecast

Angel Rubio

05/08/2022