Midterm project: Recruit Holdings: Restaurant Visitor Forecasting

Problem
Restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members. There are many unpredictable factors that affect restaurant attendance, like weather and local competition.

In this competition, using reservation and visitation data we want to predict the total number of visitors to a restaurant for future dates.

Significance
This prediction is significant as it will make planning easier and reduce wastage. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.

Data

Exploring Data

The data for this problem comes from two separate sites:

Hot Pepper Gourmet (hpg): similar to Yelp, here users can search restaurants and also make a reservation online
AirREGI / Restaurant Board (air): similar to Square, a reservation control and cash register system

Lets load them all together using read.csv()

air_reserve<-read.csv("air_reserve.csv")
hpg_reserve<-read.csv("hpg_reserve.csv")
air_visits<-read.csv("air_visit_data.csv")
date_info<-read.csv("date_info.csv")
air_store_info<-read.csv("air_store_info.csv")
hpg_store_info<-read.csv("hpg_store_info.csv")
# store_id_relation<-read.csv("store_id_relation.csv")

There is information available about the reservations, visits, and other information from these sites for a restaurant in Japan. The training data covers the dates from Jan 2016 until April 2017. The test set covers the last week of April and May 2017. The test set intentionally spans a holiday week in Japan called the “Golden Week.”

head(air_reserve) ; head(hpg_reserve)

##           air_store_id      visit_datetime    reserve_datetime
## 1 air_877f79706adbfb06 2016-01-01 19:00:00 2016-01-01 16:00:00
## 2 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00
## 3 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00
## 4 air_877f79706adbfb06 2016-01-01 20:00:00 2016-01-01 16:00:00
## 5 air_db80363d35f10926 2016-01-01 20:00:00 2016-01-01 01:00:00
## 6 air_db80363d35f10926 2016-01-02 01:00:00 2016-01-01 16:00:00
##   reserve_visitors
## 1                1
## 2                3
## 3                6
## 4                2
## 5                5
## 6                2

##           hpg_store_id      visit_datetime    reserve_datetime
## 1 hpg_c63f6f42e088e50f 2016-01-01 11:00:00 2016-01-01 09:00:00
## 2 hpg_dac72789163a3f47 2016-01-01 13:00:00 2016-01-01 06:00:00
## 3 hpg_c8e24dcf51ca1eb5 2016-01-01 16:00:00 2016-01-01 14:00:00
## 4 hpg_24bb207e5fd49d4a 2016-01-01 17:00:00 2016-01-01 11:00:00
## 5 hpg_25291c542ebb3bc2 2016-01-01 17:00:00 2016-01-01 03:00:00
## 6 hpg_28bdf7a336ec6a7b 2016-01-01 17:00:00 2016-01-01 15:00:00
##   reserve_visitors
## 1                1
## 2                3
## 3                2
## 4                5
## 5               13
## 6                2

head(air_visits); head(hpg_store_info)

##           air_store_id visit_date visitors
## 1 air_ba937bf13d40fb24 2016-01-13       25
## 2 air_ba937bf13d40fb24 2016-01-14       32
## 3 air_ba937bf13d40fb24 2016-01-15       29
## 4 air_ba937bf13d40fb24 2016-01-16       22
## 5 air_ba937bf13d40fb24 2016-01-18        6
## 6 air_ba937bf13d40fb24 2016-01-19        9

##           hpg_store_id hpg_genre_name                    hpg_area_name
## 1 hpg_6622b62385aec8bf Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 2 hpg_e9e068dd49c5fa00 Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 3 hpg_2976f7acb4b3a3bc Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 4 hpg_e51a522e098f024c Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 5 hpg_e3d0e1519894f275 Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 6 hpg_530cd91db13b938e Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
##   latitude longitude
## 1 35.64367  139.6682
## 2 35.64367  139.6682
## 3 35.64367  139.6682
## 4 35.64367  139.6682
## 5 35.64367  139.6682
## 6 35.64367  139.6682

library(psych)
des<-rbind(describe(air_reserve),describe(hpg_reserve),describe(date_info),
         describe(air_visits),describe(air_store_info),describe(hpg_store_info))
des

##                    vars       n    mean      sd  median trimmed     mad
## air_store_id*         1   92378  151.34   92.55  147.00  150.44  121.57
## visit_datetime*       2   92378 2958.96 1188.51 3079.00 3054.18 1214.25
## reserve_datetime*     3   92378 4806.34 1942.51 5075.00 5004.60 1752.43
## reserve_visitors      4   92378    4.48    4.92    3.00    3.39    1.48
## hpg_store_id*         1 2000320 6680.24 3854.37 6683.00 6684.65 4965.23
## visit_datetime*1      2 2000320 5635.59 2681.73 6377.00 5810.69 2830.28
## reserve_datetime*1    3 2000320 6724.42 3228.43 7537.00 6912.04 3629.40
## reserve_visitors1     4 2000320    5.07    5.42    3.00    3.92    1.48
## calendar_date*        1     517  259.00  149.39  259.00  259.00  191.26
## day_of_week*          2     517    4.00    2.00    4.00    4.00    2.97
## holiday_flg           3     517    0.07    0.25    0.00    0.00    0.00
## air_store_id*1        1  252108  413.70  239.70  407.00  413.21  309.86
## visit_date*           2  252108  286.53  123.15  297.00  294.11  137.88
## visitors              3  252108   20.97   16.76   17.00   18.82   13.34
## air_store_id*2        1     829  415.00  239.46  415.00  415.00  306.90
## air_genre_name*       2     829    6.26    3.13    7.00    6.02    2.97
## air_area_name*        3     829   54.10   31.14   58.00   54.90   43.00
## latitude              4     829   35.65    2.08   35.66   35.26    0.10
## longitude             5     829  137.42    3.65  139.69  137.82    0.20
## hpg_store_id*1        1    4690 2345.50 1354.03 2345.50 2345.50 1738.35
## hpg_genre_name*       2    4690   14.85    5.08   16.00   14.65    5.93
## hpg_area_name*        3    4690   66.05   35.10   70.00   67.16   45.96
## latitude1             4    4690   35.81    2.14   35.66   35.38    1.01
## longitude1            5    4690  137.68    3.20  139.50  138.09    0.78
##                       min      max    range  skew kurtosis    se
## air_store_id*        1.00   314.00   313.00  0.10    -1.30  0.30
## visit_datetime*      1.00  4975.00  4974.00 -0.60    -0.39  3.91
## reserve_datetime*    1.00  7513.00  7512.00 -0.80    -0.19  6.39
## reserve_visitors     1.00   100.00    99.00  4.65    33.52  0.02
## hpg_store_id*        1.00 13325.00 13324.00 -0.01    -1.21  2.73
## visit_datetime*1     1.00  9847.00  9846.00 -0.50    -0.94  1.90
## reserve_datetime*1   1.00 11450.00 11449.00 -0.44    -0.99  2.28
## reserve_visitors1    1.00   100.00    99.00  4.60    36.10  0.00
## calendar_date*       1.00   517.00   516.00  0.00    -1.21  6.57
## day_of_week*         1.00     7.00     6.00  0.00    -1.26  0.09
## holiday_flg          0.00     1.00     1.00  3.43     9.79  0.01
## air_store_id*1       1.00   829.00   828.00  0.02    -1.21  0.48
## visit_date*          1.00   478.00   477.00 -0.41    -0.69  0.25
## visitors             1.00   877.00   876.00  3.31    74.26  0.03
## air_store_id*2       1.00   829.00   828.00  0.00    -1.20  8.32
## air_genre_name*      1.00    14.00    13.00  0.45    -0.41  0.11
## air_area_name*       1.00   103.00   102.00 -0.17    -1.34  1.08
## latitude            33.21    44.02    10.81  2.65     7.44  0.07
## longitude          130.20   144.27    14.08 -0.93    -0.55  0.13
## hpg_store_id*1       1.00  4690.00  4689.00  0.00    -1.20 19.77
## hpg_genre_name*      1.00    34.00    33.00  0.41     0.47  0.07
## hpg_area_name*       1.00   119.00   118.00 -0.19    -1.21  0.51
## latitude1           33.31    43.77    10.46  2.50     6.04  0.03
## longitude1         130.34   143.71    13.38 -0.98    -0.14  0.05

We find that air_visits file contains the visitors numbers for each visit_date and air_store_id. The date feature should be transformed into a time-series format and There are 829 unique participating stores. There are 252K total records (~829x304) in this dataset and 3 columns.

We find that the air_reserve include the date and time of the reservation, as well as those of the visit. We have reservation numbers for 314 air stores. There are 92K total records (~314x294) in this dataset and 4 columns. hpg_reserve in the same structure have reservation numbers for 13325 hpg stores with 2M records(~13325x150) and 4 columns.

air_store info file has geographical and cuisine related information for each air store. It has 829 records. The hpg_store info follows the same structure as this. There are however only 4690 different hpg_store_ids, as against the 13325 from the reservations information above.

date_info file essentially contains the holiday information. It is a logical binary feature indicating 1 for holiday 0 otherwise.

Missing Value Analysis/Treatment

Using the describe() we observed that there are no missing values in these files.

Feature Reformatting

We change the formatting of the date/time features and also reformat a few features to logical and factor variables for exploration purposes.

Visualizing Data
- Univariate:

library(ggplot2); library(tidyr); library(purrr)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Warning: package 'tidyr' was built under R version 3.4.2

## Warning: package 'purrr' was built under R version 3.4.2

# air_visits[, .(total_visits = sum(visitors)), by = list(visit_date)] %>%
#    ggplot(aes(x = total_visits)) +
#    geom_histogram(fill = 'steelblue') +
#    labs(x = 'Daily visits', title = 'Distribution of daily visits')

air.date<-aggregate(air_visits$visitors,by=list(visit_date = air_visits$visit_date),           FUN=sum)
names(air.date)[2]<-c("daily.total")

hist(as.numeric(air.date$daily.total),xlab = 'Daily visits', main = 'Distribution of daily visits',breaks=29)

air.store<-aggregate(air_visits$visitors,by=list(air_store_id = air_visits$air_store_id) ,FUN=sum)
names(air.store)[2]<-c("store.total")

air.store.in<-merge(air_visits,air_store_info,by="air_store_id")
# install.packages("ggmap")
# library(ggmap)

# map_center = c(mean(air.store.in$longitude), mean(air.store.in$latitude))
# map = ggmap(get_googlemap(center=map_center, scale=2, zoom=4), extent="normal")

# map + 
# geom_point(aes(x=air.store.in$longitude, y=air.store.in$lattitude), data=air.store.in, col="orange", alpha=0.4, size=air.store.in$store.total) + 
# scale_size_continuous(range=range(air.store.in$store.total))

air.genre<-aggregate(air.store.in$visitors,by=list(genre = air.store.in$air_genre_name), FUN=sum)

barplot(names.arg = air.genre$genre,horiz = FALSE, height = air.genre$x,col=heat.colors(length(unique(air.genre$genre))), border = "dark blue")
title("Total customers by genre name")

Top 3 cuisines genres be: Izakaya, Cafe/Sweets, Italian/ French.

There are further more visualizations I created here on my Tableau Public Profile:

Univariate here.
Bivariate here.

We thus learned in this step:

There is apparent increase in the no. of restaurants in June 2016, causing an increase in total sales/visitors henceforth.
There is no clear trend otherwise in the data.
There is no apparent seasonality (maybe due to limited availaibility of data ~1.5 yrs).
There is clearly pattern of weekly cycle with prominent weekend and weekday effect.
Using AIR reservation, we find that Top 3 cuisines of choice are Izakaya, Cafes and Dining Bars.
Most restaurants are found in the location - Tokyo-to.
However, average no. of visitors is the maximum in Asian cuisine restaurants consistently so in both years. Either its more liked cuisine or more Asian people reside in these areas.
There are holidays in each month except June 2016 and only during Sept 2016 is an exception that average number of visits is less on a holiday than non-holiday.
Tokyo-to followed by Fukoka-ken has the most variety of cuisines/genres to offer.

Literature

There are some very useful studies done for short-term demand planning for restaurant industry which I have detailed below have helped me look at this problem in more holistic way.

The restaurant industry experiences high growth (boom) every five years on average. The troughs of the growth cycles, contrast to the peaks of the growth cycles, coinciding with those of the restaurant industry business cycles. Recently, Choi et al. (1999) developed a hotel industry cycle model that represents the cyclical characteristic of the hotel industry for the last 28 years (from 1966 to 1993).

Choi, Jeong-Gil from VirginiaTech authored in a study supporting the view that the cyclical fluctuations of the growth of the restaurant industry can be projected by measuring and analyzing series of economic indicators and each economic indicator has specific characteristics in terms of time lags, and thus can be classified into leading, coincident, and lagging indicators. This study formed a set of composite indices with twelve indicators classified in the leading category, six as coincident, and twenty as lagging[1].

Very little prior research has assessed econometric models for the restaurant industry. Cranage and Andrew (1992) as well as Morgan and Chintagunta (1997) created econometric models for restaurants; however, these studies focused on data from a single operation and forecasted only weeks or months into the future. In addition, these studies lacked aggregated data that is practical at the unit level; moreover, there was no comparison of model fit across industry segments.

Data Checks/ Sample Run

library(forecast)

## Warning: package 'forecast' was built under R version 3.4.3

sample.air.id = "air_ba937bf13d40fb24"

sample.data<- subset(air_visits, air_store_id == sample.air.id)

sample.train<- subset(sample.data,as.Date(visit_date) < "2017-04-08")
sample.valid<- subset(sample.data,as.Date(visit_date) >= "2017-04-08")

myseries<-ts(sample.train$visitors, frequency = 7)

plot(myseries)

plot(tsclean(myseries))

plot(decompose(myseries))

ses.fit1<-ses(myseries, alpha=0.2, initial="simple", h=14)
summary(ses.fit1)

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = myseries, h = 14, initial = "simple", alpha = 0.2) 
## 
##   Smoothing parameters:
##     alpha = 0.2 
## 
##   Initial states:
##     l = 25 
## 
##   sigma:  12.0782
## Error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.02180043 12.07819 9.514467 -62.49375 88.55174 0.8102078
##                     ACF1
## Training set -0.05489637
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80      Lo 95    Hi 95
## 54.85714       23.35625 7.877423 38.83507 -0.3165709 47.02907
## 55.00000       23.35625 7.570882 39.14161 -0.7853851 47.49788
## 55.14286       23.35625 7.270182 39.44231 -1.2452670 47.95776
## 55.28571       23.35625 6.975000 39.73749 -1.6967086 48.40920
## 55.42857       23.35625 6.685044 40.02745 -2.1401582 48.85265
## 55.57143       23.35625 6.400045 40.31245 -2.5760258 49.28852
## 55.71429       23.35625 6.119759 40.59274 -3.0046874 49.71718
## 55.85714       23.35625 5.843957 40.86854 -3.4264891 50.13898
## 56.00000       23.35625 5.572433 41.14006 -3.8417501 50.55424
## 56.14286       23.35625 5.304992 41.40750 -4.2507655 50.96326
## 56.28571       23.35625 5.041456 41.67104 -4.6538089 51.36630
## 56.42857       23.35625 4.781659 41.93084 -5.0511345 51.76363
## 56.57143       23.35625 4.525445 42.18705 -5.4429790 52.15547
## 56.71429       23.35625 4.272672 42.43982 -5.8295631 52.54206

ses.fit2<-ses(myseries, alpha=0.8, initial="simple", h=14)

plot(ses.fit1, ylab="Daily Vistors",  xlab="Days", main="", fcol="white", type="o")
lines(fitted(ses.fit1), col="blue", type="o")
lines(fitted(ses.fit2), col="red", type="o")
lines(ses.fit1$mean, col="blue", type="o")
lines(ses.fit2$mean, col="red", type="o")
legend("topleft",lty=1, col=c(1,"blue","red"), c("data",expression(alpha == 0.2),expression(alpha == 0.8)),pch=1)

# Holt Winters Linear Trend 
holt.fit1 <- holt(myseries, alpha=0.8, beta=0.2, initial="simple", h=14) 
summary(holt.fit1)

## 
## Forecast method: Holt's method
## 
## Model Information:
## Holt's method 
## 
## Call:
##  holt(y = myseries, h = 14, initial = "simple", alpha = 0.8, beta = 0.2) 
## 
##   Smoothing parameters:
##     alpha = 0.8 
##     beta  = 0.2 
## 
##   Initial states:
##     l = 25 
##     b = 7 
## 
##   sigma:  16.0293
## Error measures:
##                       ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -0.04519378 16.02928 12.51845 -72.98921 114.3338 1.066013
##                    ACF1
## Training set -0.2833701
## 
## Forecasts:
##          Point Forecast      Lo 80     Hi 80       Lo 95     Hi 95
## 54.85714       43.63378  23.091430  64.17614   12.216965  75.05060
## 55.00000       47.90770  18.856420  76.95897    3.477603  92.33779
## 55.14286       52.18161  14.081216  90.28200   -6.087913 110.45113
## 55.28571       56.45552   8.719401 104.19164  -16.550573 129.46161
## 55.42857       60.72943   2.772317 118.68654  -27.908326 149.36718
## 55.57143       65.00334  -3.744525 133.75121  -40.137447 170.14413
## 55.71429       69.27725 -10.811588 149.36609  -53.208058 191.76256
## 55.85714       73.55116 -18.408858 165.51119  -67.089554 214.19188
## 56.00000       77.82508 -26.517142 182.16729  -81.752576 237.40273
## 56.14286       82.09899 -35.118501 199.31647  -97.169690 261.36766
## 56.28571       86.37290 -44.196347 216.94214 -113.315530 286.06133
## 56.42857       90.64681 -53.735400 235.02902 -130.166725 311.46034
## 56.57143       94.92072 -63.721590 253.56303 -147.701756 337.54320
## 56.71429       99.19463 -74.141941 272.53121 -165.900780 364.29004

# Exponential Trend
holt.fit2<-holt(myseries,alpha=0.8,beta=0.2,initial="simple",exponential=TRUE, h=14) 

# Damped Trend
holt.fit3 <- holt(myseries, alpha=0.8, beta=0.2, damped=TRUE, initial="simple", h=14)

## Warning in holt(myseries, alpha = 0.8, beta = 0.2, damped = TRUE, initial =
## "simple", : Damped Holt's method requires optimal initialization

# Results for first model:
# holt.fit1$model$state
plot(ses.fit1, ylab="Daily Vistors",  xlab="Days", main="", fcol="white", type="o")
lines(fitted(holt.fit1), col="blue", type="o")
lines(fitted(holt.fit2), col="red", type="o")
lines(fitted(holt.fit3), col="green", type="o")
lines(holt.fit1$mean, col="blue", type="o")
lines(holt.fit2$mean, col="red", type="o")
lines(holt.fit3$mean, col="green", type="o")
legend("bottomleft",lty=1, col=c(1,"blue","red","green"), c("data","Holt1", "Holt2-Exp",        "Holt3-damp"), pch=1)

tsdisplay(sample.data$visitors)

tsdisplay(diff(sample.data$visitors,1))

arima.fit <- auto.arima(myseries,stepwise = FALSE, approximation = FALSE)
summary(arima.fit)

## Series: myseries 
## ARIMA(2,1,3) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3
##       1.0580  -0.9203  -2.0242  1.8315  -0.7728
## s.e.  0.0383   0.0356   0.0643  0.0914   0.0478
## 
## sigma^2 estimated as 118.3:  log likelihood=-1429.97
## AIC=2871.94   AICc=2872.17   BIC=2895.52
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -0.1814535 10.78761 8.647176 -52.97218 76.66563 0.7363533
##                     ACF1
## Training set -0.07758166

arima_visits <- forecast(arima.fit,h = 14)

# accuracy(arima_visits$fitted,sample.valid)

plot(arima.fit$x, ylab="Daily Vistors",  xlab="Days", main="",type="o")
lines(fitted(arima.fit), col="blue", type="o")
lines(fitted(arima.fit$mean), col="blue", type="o")

Similarly, I am going to run and forecast for each restaurant for each month.

Types of Models

In this forecasting project, I have experimented with several forecasting techniques or models that include - Exponential smoothing models, ARIMA models and other advanced modelling techniques like Random Forests and XGBoost.

Formulation

Econometrics based mathematical formulations of these models discussed above are given as following:
* Simple Exponential Smoothing: $y^t+1^ = alpha x y^t^ + (1-alpha) x y^(t-1)$
* Holt Linear Trend: $y^t+1 = alpha*y^t + error $

Moving average: This forecast formula calculates the base forecast for the next period as the average of historic base demand for a specified number of periods. This is denoted in the following equation: F(t + 1) = (D(t) + D(t - 1) + …. + D(t - (n - 1))) / n

The number of periods used determines how quickly the averaging will react to changes in actual trends and how sensitive it will be to random variations. The more periods included will make the calculation method more stable from random variations, but it will also react more slowly to changes resulting from real trends.

Two-period weighted average: This forecast formula weighs the average demand from the latest quarter (of periods included in the forecast) with the average demand for all historic periods. The weight factor is the smoothing constant for exponential smoothing, a, and (1 - a), respectively. This is denoted in the following equation:

F(t + 1) = a x M + (1-a) x L(t)

Exponential smoothing: This forecast formula weighs latest base demand value with the smoothing constant alpha, while the previous base forecast value is weighted with (1 - alpha). This is denoted in the following equation:

F(t + 1) = alpha x D(t) + (1-alpha) x F(t)

The value of smoothing constant alpha determines how quickly the forecast will react to changes in actual trends and how sensitive it will be to random variations. The lower the value, the more stable the calculation is from random variations, but it will also react more slowly to changes resulting from real trends. Smoothing constant must be between 0 and 1.

Adaptive exponential smoothing: This forecast formula is similar to basic exponential smoothing in that the latest base demand value is weighted with smoothing constant alpha, while the previous base forecast value is weighted with 1 - alpha. However, in adaptive exponential smoothing, the smoothing constant is recalculated every time a new forecast is made. This is denoted in the following equation:

F(t + 1) = ((i) x D(t) + (1 - ((i)) x F(t)

The smoothing constant is recalculated using the following equation:

= ((min.) + ((max.) * (ABS(ME(i)) / MAD(i)

This forecast formula uses ( values that are adjusted for the current systematic forecast error. A larger mean forecast error results in a higher ( value. This results in a quicker corrections to the forecast towards reflecting actual demand.

Key:
D(t): Base demand at time t.
F(t): Base forecast at time t.
L: Level of demand or mean of demand for all obs. M: Level of demand or mean of demand for latest 25% obs.

Final Modeling

# define the model and order the steps, store the results put them in 1 place, pick the best model , create a loop, subset each stores data,
# forecast using validation data with the best model.
# identify there seperate start and end dates for splitting

storelist<-unique(air_visits$air_store_id)

results<-NULL
#names(results)<-c("modelName","p","d","q","P","D","Q","S")

#for(i in 1:length(air_visits)) {
#Data Prep
#store_i<- subset(air_visits, air_store_id == storelist[i])
#store_i$visit_date<-as.Date(store_i$visit_date)
#split_date<-as.Date(max(store_i$visit_date)) - 13 
#store_i.train<-subset(air_visits, store_i$visit_date <= split_date)
#store_i.valid<-subset(air_visits, store_i$visit_date > split_date)

#myts<-ts(store_i.train$visitors,frequency = 7)
# Building Models
#ses.fit<-ses(myts, h=13)

#result<-rbind(results,c("SES",ses.fit$model$par[1],0,0,0,0,0,0))

#arima.fit<-auto.arima(myts, approximation = FALSE)
#arima_visits <- forecast(arima.fit,h = 13)

#result<-rbind(results,c("ARIMA",ses.fit$model$par[1],0,0,0,0,0,0))
#}

# Save Results
# Add other variables in regression if time permits.

Perfomance Accuracy

I have chosen the within-model accuracies by picking least AICc and RMSLE for between-models on the test dataset.

library(data.table)

## Warning: package 'data.table' was built under R version 3.4.2

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:purrr':
## 
##     transpose

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#library(date)
library(doParallel)

## Warning: package 'doParallel' was built under R version 3.4.4

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.4.2

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

library(snow)

## Warning: package 'snow' was built under R version 3.4.4

## 
## Attaching package: 'snow'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, clusterSplit, makeCluster,
##     parApply, parCapply, parLapply, parRapply, parSapply,
##     splitIndices, stopCluster

# data load and pre-processing
train_wide <- dcast(
  air_visits, air_store_id ~ visit_date, value.var = "visitors", fill = 0)
train_ts <- ts(train_wide[, 2:dim(train_wide)[2]], frequency = 7) 
fcst_intv = 39  ### 39 days of forecast horizon
fcst_matrix <- matrix(NA,nrow=4*nrow(train_ts),ncol=fcst_intv)

### forecasting - register cores for parallel processing
workers=makeCluster(4,type="SOCK")
registerDoParallel(workers)

foreach(i=1:4) %dopar% Sys.getpid()

## [[1]]
## [1] 17972
## 
## [[2]]
## [1] 20344
## 
## [[3]]
## [1] 4676
## 
## [[4]]
## [1] 19992

registerDoParallel(detectCores()-1)
fcst_matrix <- foreach(i=1:nrow(train_ts),.combine=rbind, .packages=c("forecast")) %dopar% { 
    fcst_ets <- forecast(ets(train_ts[i,]),h=fcst_intv)$mean
    fcst_nnet <- forecast(nnetar(train_ts[i,]),h=fcst_intv)$mean
    fcst_arima <- forecast(auto.arima(train_ts[i,]),h=fcst_intv)$mean
    fcst_ses <- forecast(HoltWinters(train_ts[i,], beta=FALSE, gamma=FALSE),h=fcst_intv)$mean
    fcst_matrix <- rbind(fcst_ets, fcst_nnet, fcst_arima, fcst_ses)
}

### post-processing the forecast table
fcst_matrix_mix <- aggregate(fcst_matrix,list(rep(1:(nrow(fcst_matrix)/4),each=4)),mean)[-1]
fcst_matrix_mix[fcst_matrix_mix < 0] <- 0
colnames(fcst_matrix_mix) <- as.character(
  seq(from = as.Date("2017-04-23"), to = as.Date("2017-05-31"), by = 'day'))
fcst_df <- as.data.frame(cbind(train_wide[, 1], fcst_matrix_mix)) 
colnames(fcst_df)[1] <- "air_store_id"

### melt the forecast data frame from wide to long format for final submission
fcst_df_long <- melt(
  fcst_df, id = 'air_store_id', variable.name = "fcst_date", value.name = 'visitors')
fcst_df_long$air_store_id <- as.character(fcst_df_long$air_store_id)
fcst_df_long$fcst_date <- as.Date(parse_date_time(fcst_df_long$fcst_date,'%y-%m-%d'))
fcst_df_long$visitors <- as.numeric(fcst_df_long$visitors)

### get & process the sample submission file
sample_sub <- fread("sample_submission.csv")
sample_sub$visitors <- NULL
sample_sub$store_id <- substr(sample_sub$id, 1, 20)
sample_sub$visit_date <- substr(sample_sub$id, 22, 31)
sample_sub$visit_date <- as.Date(parse_date_time(sample_sub$visit_date,'%y-%m-%d'))

### generate the final submission file
submission <- left_join(
  sample_sub, fcst_df_long, c("store_id" = "air_store_id", 'visit_date' = 'fcst_date'))
submission$visitors[is.na(submission$visitors)] <- 0
final_sub <- select(submission, c('id', 'visitors'))
write.csv(final_sub, "sub_ts_mix.csv", row.names = FALSE)

Limitations

The limitations of this project is in terms of limited availability of the data relating to factors that may affect the flow of restaurant visitors. As I get more and more variables that the visitors information is related with I would expect the causal forecasting results to improve.

Future Work

As next steps, I think I would like to focus on the store(restaurant) level visits and group them on some parameter e.g. size of the store, neighbourhood to tourist places, some sort of rank or tier of the city/town information and then focus the models based on that segmentation.

Learnings

I have honed my forecasting skills during this project. And also learnt that along the way of forecasting equally important is to develop the understanding of data that is available. Asking the right question, developing the right hypotheses are significant things to work on in any modeling technique.