Midterm project: Recruit Holdings: Restaurant Visitor Forecasting
Problem
Restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members. There are many unpredictable factors that affect restaurant attendance, like weather and local competition.
In this competition, using reservation and visitation data we want to predict the total number of visitors to a restaurant for future dates.
Significance
This prediction is significant as it will make planning easier and reduce wastage. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.
Data
The data for this problem comes from two separate sites:
Lets load them all together using read.csv()
air_reserve<-read.csv("air_reserve.csv")
hpg_reserve<-read.csv("hpg_reserve.csv")
air_visits<-read.csv("air_visit_data.csv")
date_info<-read.csv("date_info.csv")
air_store_info<-read.csv("air_store_info.csv")
hpg_store_info<-read.csv("hpg_store_info.csv")
# store_id_relation<-read.csv("store_id_relation.csv")
There is information available about the reservations, visits, and other information from these sites for a restaurant in Japan. The training data covers the dates from Jan 2016 until April 2017. The test set covers the last week of April and May 2017. The test set intentionally spans a holiday week in Japan called the “Golden Week.”
head(air_reserve) ; head(hpg_reserve)
## air_store_id visit_datetime reserve_datetime
## 1 air_877f79706adbfb06 2016-01-01 19:00:00 2016-01-01 16:00:00
## 2 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00
## 3 air_db4b38ebe7a7ceff 2016-01-01 19:00:00 2016-01-01 19:00:00
## 4 air_877f79706adbfb06 2016-01-01 20:00:00 2016-01-01 16:00:00
## 5 air_db80363d35f10926 2016-01-01 20:00:00 2016-01-01 01:00:00
## 6 air_db80363d35f10926 2016-01-02 01:00:00 2016-01-01 16:00:00
## reserve_visitors
## 1 1
## 2 3
## 3 6
## 4 2
## 5 5
## 6 2
## hpg_store_id visit_datetime reserve_datetime
## 1 hpg_c63f6f42e088e50f 2016-01-01 11:00:00 2016-01-01 09:00:00
## 2 hpg_dac72789163a3f47 2016-01-01 13:00:00 2016-01-01 06:00:00
## 3 hpg_c8e24dcf51ca1eb5 2016-01-01 16:00:00 2016-01-01 14:00:00
## 4 hpg_24bb207e5fd49d4a 2016-01-01 17:00:00 2016-01-01 11:00:00
## 5 hpg_25291c542ebb3bc2 2016-01-01 17:00:00 2016-01-01 03:00:00
## 6 hpg_28bdf7a336ec6a7b 2016-01-01 17:00:00 2016-01-01 15:00:00
## reserve_visitors
## 1 1
## 2 3
## 3 2
## 4 5
## 5 13
## 6 2
head(air_visits); head(hpg_store_info)
## air_store_id visit_date visitors
## 1 air_ba937bf13d40fb24 2016-01-13 25
## 2 air_ba937bf13d40fb24 2016-01-14 32
## 3 air_ba937bf13d40fb24 2016-01-15 29
## 4 air_ba937bf13d40fb24 2016-01-16 22
## 5 air_ba937bf13d40fb24 2016-01-18 6
## 6 air_ba937bf13d40fb24 2016-01-19 9
## hpg_store_id hpg_genre_name hpg_area_name
## 1 hpg_6622b62385aec8bf Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 2 hpg_e9e068dd49c5fa00 Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 3 hpg_2976f7acb4b3a3bc Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 4 hpg_e51a522e098f024c Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 5 hpg_e3d0e1519894f275 Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## 6 hpg_530cd91db13b938e Japanese style TÅkyÅ-to Setagaya-ku TaishidÅ
## latitude longitude
## 1 35.64367 139.6682
## 2 35.64367 139.6682
## 3 35.64367 139.6682
## 4 35.64367 139.6682
## 5 35.64367 139.6682
## 6 35.64367 139.6682
library(psych)
des<-rbind(describe(air_reserve),describe(hpg_reserve),describe(date_info),
describe(air_visits),describe(air_store_info),describe(hpg_store_info))
des
## vars n mean sd median trimmed mad
## air_store_id* 1 92378 151.34 92.55 147.00 150.44 121.57
## visit_datetime* 2 92378 2958.96 1188.51 3079.00 3054.18 1214.25
## reserve_datetime* 3 92378 4806.34 1942.51 5075.00 5004.60 1752.43
## reserve_visitors 4 92378 4.48 4.92 3.00 3.39 1.48
## hpg_store_id* 1 2000320 6680.24 3854.37 6683.00 6684.65 4965.23
## visit_datetime*1 2 2000320 5635.59 2681.73 6377.00 5810.69 2830.28
## reserve_datetime*1 3 2000320 6724.42 3228.43 7537.00 6912.04 3629.40
## reserve_visitors1 4 2000320 5.07 5.42 3.00 3.92 1.48
## calendar_date* 1 517 259.00 149.39 259.00 259.00 191.26
## day_of_week* 2 517 4.00 2.00 4.00 4.00 2.97
## holiday_flg 3 517 0.07 0.25 0.00 0.00 0.00
## air_store_id*1 1 252108 413.70 239.70 407.00 413.21 309.86
## visit_date* 2 252108 286.53 123.15 297.00 294.11 137.88
## visitors 3 252108 20.97 16.76 17.00 18.82 13.34
## air_store_id*2 1 829 415.00 239.46 415.00 415.00 306.90
## air_genre_name* 2 829 6.26 3.13 7.00 6.02 2.97
## air_area_name* 3 829 54.10 31.14 58.00 54.90 43.00
## latitude 4 829 35.65 2.08 35.66 35.26 0.10
## longitude 5 829 137.42 3.65 139.69 137.82 0.20
## hpg_store_id*1 1 4690 2345.50 1354.03 2345.50 2345.50 1738.35
## hpg_genre_name* 2 4690 14.85 5.08 16.00 14.65 5.93
## hpg_area_name* 3 4690 66.05 35.10 70.00 67.16 45.96
## latitude1 4 4690 35.81 2.14 35.66 35.38 1.01
## longitude1 5 4690 137.68 3.20 139.50 138.09 0.78
## min max range skew kurtosis se
## air_store_id* 1.00 314.00 313.00 0.10 -1.30 0.30
## visit_datetime* 1.00 4975.00 4974.00 -0.60 -0.39 3.91
## reserve_datetime* 1.00 7513.00 7512.00 -0.80 -0.19 6.39
## reserve_visitors 1.00 100.00 99.00 4.65 33.52 0.02
## hpg_store_id* 1.00 13325.00 13324.00 -0.01 -1.21 2.73
## visit_datetime*1 1.00 9847.00 9846.00 -0.50 -0.94 1.90
## reserve_datetime*1 1.00 11450.00 11449.00 -0.44 -0.99 2.28
## reserve_visitors1 1.00 100.00 99.00 4.60 36.10 0.00
## calendar_date* 1.00 517.00 516.00 0.00 -1.21 6.57
## day_of_week* 1.00 7.00 6.00 0.00 -1.26 0.09
## holiday_flg 0.00 1.00 1.00 3.43 9.79 0.01
## air_store_id*1 1.00 829.00 828.00 0.02 -1.21 0.48
## visit_date* 1.00 478.00 477.00 -0.41 -0.69 0.25
## visitors 1.00 877.00 876.00 3.31 74.26 0.03
## air_store_id*2 1.00 829.00 828.00 0.00 -1.20 8.32
## air_genre_name* 1.00 14.00 13.00 0.45 -0.41 0.11
## air_area_name* 1.00 103.00 102.00 -0.17 -1.34 1.08
## latitude 33.21 44.02 10.81 2.65 7.44 0.07
## longitude 130.20 144.27 14.08 -0.93 -0.55 0.13
## hpg_store_id*1 1.00 4690.00 4689.00 0.00 -1.20 19.77
## hpg_genre_name* 1.00 34.00 33.00 0.41 0.47 0.07
## hpg_area_name* 1.00 119.00 118.00 -0.19 -1.21 0.51
## latitude1 33.31 43.77 10.46 2.50 6.04 0.03
## longitude1 130.34 143.71 13.38 -0.98 -0.14 0.05
We find that air_visits file contains the visitors numbers for each visit_date and air_store_id. The date feature should be transformed into a time-series format and There are 829 unique participating stores. There are 252K total records (~829x304) in this dataset and 3 columns.
We find that the air_reserve include the date and time of the reservation, as well as those of the visit. We have reservation numbers for 314 air stores. There are 92K total records (~314x294) in this dataset and 4 columns. hpg_reserve in the same structure have reservation numbers for 13325 hpg stores with 2M records(~13325x150) and 4 columns.
air_store info file has geographical and cuisine related information for each air store. It has 829 records. The hpg_store info follows the same structure as this. There are however only 4690 different hpg_store_ids, as against the 13325 from the reservations information above.
date_info file essentially contains the holiday information. It is a logical binary feature indicating 1 for holiday 0 otherwise.
Using the describe() we observed that there are no missing values in these files.
We change the formatting of the date/time features and also reformat a few features to logical and factor variables for exploration purposes.
Visualizing Data
library(ggplot2); library(tidyr); library(purrr)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Warning: package 'tidyr' was built under R version 3.4.2
## Warning: package 'purrr' was built under R version 3.4.2
# air_visits[, .(total_visits = sum(visitors)), by = list(visit_date)] %>%
# ggplot(aes(x = total_visits)) +
# geom_histogram(fill = 'steelblue') +
# labs(x = 'Daily visits', title = 'Distribution of daily visits')
air.date<-aggregate(air_visits$visitors,by=list(visit_date = air_visits$visit_date), FUN=sum)
names(air.date)[2]<-c("daily.total")
hist(as.numeric(air.date$daily.total),xlab = 'Daily visits', main = 'Distribution of daily visits',breaks=29)
air.store<-aggregate(air_visits$visitors,by=list(air_store_id = air_visits$air_store_id) ,FUN=sum)
names(air.store)[2]<-c("store.total")
air.store.in<-merge(air_visits,air_store_info,by="air_store_id")
# install.packages("ggmap")
# library(ggmap)
# map_center = c(mean(air.store.in$longitude), mean(air.store.in$latitude))
# map = ggmap(get_googlemap(center=map_center, scale=2, zoom=4), extent="normal")
# map +
# geom_point(aes(x=air.store.in$longitude, y=air.store.in$lattitude), data=air.store.in, col="orange", alpha=0.4, size=air.store.in$store.total) +
# scale_size_continuous(range=range(air.store.in$store.total))
air.genre<-aggregate(air.store.in$visitors,by=list(genre = air.store.in$air_genre_name), FUN=sum)
barplot(names.arg = air.genre$genre,horiz = FALSE, height = air.genre$x,col=heat.colors(length(unique(air.genre$genre))), border = "dark blue")
title("Total customers by genre name")
Top 3 cuisines genres be: Izakaya, Cafe/Sweets, Italian/ French.
There are further more visualizations I created here on my Tableau Public Profile:
We thus learned in this step:
There is clearly pattern of weekly cycle with prominent weekend and weekday effect.
However, average no. of visitors is the maximum in Asian cuisine restaurants consistently so in both years. Either its more liked cuisine or more Asian people reside in these areas.
There are holidays in each month except June 2016 and only during Sept 2016 is an exception that average number of visits is less on a holiday than non-holiday.
Tokyo-to followed by Fukoka-ken has the most variety of cuisines/genres to offer.
Literature
There are some very useful studies done for short-term demand planning for restaurant industry which I have detailed below have helped me look at this problem in more holistic way.
The restaurant industry experiences high growth (boom) every five years on average. The troughs of the growth cycles, contrast to the peaks of the growth cycles, coinciding with those of the restaurant industry business cycles. Recently, Choi et al. (1999) developed a hotel industry cycle model that represents the cyclical characteristic of the hotel industry for the last 28 years (from 1966 to 1993).
Choi, Jeong-Gil from VirginiaTech authored in a study supporting the view that the cyclical fluctuations of the growth of the restaurant industry can be projected by measuring and analyzing series of economic indicators and each economic indicator has specific characteristics in terms of time lags, and thus can be classified into leading, coincident, and lagging indicators. This study formed a set of composite indices with twelve indicators classified in the leading category, six as coincident, and twenty as lagging[1].
Very little prior research has assessed econometric models for the restaurant industry. Cranage and Andrew (1992) as well as Morgan and Chintagunta (1997) created econometric models for restaurants; however, these studies focused on data from a single operation and forecasted only weeks or months into the future. In addition, these studies lacked aggregated data that is practical at the unit level; moreover, there was no comparison of model fit across industry segments.
Data Checks/ Sample Run
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.3
sample.air.id = "air_ba937bf13d40fb24"
sample.data<- subset(air_visits, air_store_id == sample.air.id)
sample.train<- subset(sample.data,as.Date(visit_date) < "2017-04-08")
sample.valid<- subset(sample.data,as.Date(visit_date) >= "2017-04-08")
myseries<-ts(sample.train$visitors, frequency = 7)
plot(myseries)
plot(tsclean(myseries))
plot(decompose(myseries))
ses.fit1<-ses(myseries, alpha=0.2, initial="simple", h=14)
summary(ses.fit1)
##
## Forecast method: Simple exponential smoothing
##
## Model Information:
## Simple exponential smoothing
##
## Call:
## ses(y = myseries, h = 14, initial = "simple", alpha = 0.2)
##
## Smoothing parameters:
## alpha = 0.2
##
## Initial states:
## l = 25
##
## sigma: 12.0782
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02180043 12.07819 9.514467 -62.49375 88.55174 0.8102078
## ACF1
## Training set -0.05489637
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 54.85714 23.35625 7.877423 38.83507 -0.3165709 47.02907
## 55.00000 23.35625 7.570882 39.14161 -0.7853851 47.49788
## 55.14286 23.35625 7.270182 39.44231 -1.2452670 47.95776
## 55.28571 23.35625 6.975000 39.73749 -1.6967086 48.40920
## 55.42857 23.35625 6.685044 40.02745 -2.1401582 48.85265
## 55.57143 23.35625 6.400045 40.31245 -2.5760258 49.28852
## 55.71429 23.35625 6.119759 40.59274 -3.0046874 49.71718
## 55.85714 23.35625 5.843957 40.86854 -3.4264891 50.13898
## 56.00000 23.35625 5.572433 41.14006 -3.8417501 50.55424
## 56.14286 23.35625 5.304992 41.40750 -4.2507655 50.96326
## 56.28571 23.35625 5.041456 41.67104 -4.6538089 51.36630
## 56.42857 23.35625 4.781659 41.93084 -5.0511345 51.76363
## 56.57143 23.35625 4.525445 42.18705 -5.4429790 52.15547
## 56.71429 23.35625 4.272672 42.43982 -5.8295631 52.54206
ses.fit2<-ses(myseries, alpha=0.8, initial="simple", h=14)
plot(ses.fit1, ylab="Daily Vistors", xlab="Days", main="", fcol="white", type="o")
lines(fitted(ses.fit1), col="blue", type="o")
lines(fitted(ses.fit2), col="red", type="o")
lines(ses.fit1$mean, col="blue", type="o")
lines(ses.fit2$mean, col="red", type="o")
legend("topleft",lty=1, col=c(1,"blue","red"), c("data",expression(alpha == 0.2),expression(alpha == 0.8)),pch=1)
# Holt Winters Linear Trend
holt.fit1 <- holt(myseries, alpha=0.8, beta=0.2, initial="simple", h=14)
summary(holt.fit1)
##
## Forecast method: Holt's method
##
## Model Information:
## Holt's method
##
## Call:
## holt(y = myseries, h = 14, initial = "simple", alpha = 0.8, beta = 0.2)
##
## Smoothing parameters:
## alpha = 0.8
## beta = 0.2
##
## Initial states:
## l = 25
## b = 7
##
## sigma: 16.0293
## Error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.04519378 16.02928 12.51845 -72.98921 114.3338 1.066013
## ACF1
## Training set -0.2833701
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 54.85714 43.63378 23.091430 64.17614 12.216965 75.05060
## 55.00000 47.90770 18.856420 76.95897 3.477603 92.33779
## 55.14286 52.18161 14.081216 90.28200 -6.087913 110.45113
## 55.28571 56.45552 8.719401 104.19164 -16.550573 129.46161
## 55.42857 60.72943 2.772317 118.68654 -27.908326 149.36718
## 55.57143 65.00334 -3.744525 133.75121 -40.137447 170.14413
## 55.71429 69.27725 -10.811588 149.36609 -53.208058 191.76256
## 55.85714 73.55116 -18.408858 165.51119 -67.089554 214.19188
## 56.00000 77.82508 -26.517142 182.16729 -81.752576 237.40273
## 56.14286 82.09899 -35.118501 199.31647 -97.169690 261.36766
## 56.28571 86.37290 -44.196347 216.94214 -113.315530 286.06133
## 56.42857 90.64681 -53.735400 235.02902 -130.166725 311.46034
## 56.57143 94.92072 -63.721590 253.56303 -147.701756 337.54320
## 56.71429 99.19463 -74.141941 272.53121 -165.900780 364.29004
# Exponential Trend
holt.fit2<-holt(myseries,alpha=0.8,beta=0.2,initial="simple",exponential=TRUE, h=14)
# Damped Trend
holt.fit3 <- holt(myseries, alpha=0.8, beta=0.2, damped=TRUE, initial="simple", h=14)
## Warning in holt(myseries, alpha = 0.8, beta = 0.2, damped = TRUE, initial =
## "simple", : Damped Holt's method requires optimal initialization
# Results for first model:
# holt.fit1$model$state
plot(ses.fit1, ylab="Daily Vistors", xlab="Days", main="", fcol="white", type="o")
lines(fitted(holt.fit1), col="blue", type="o")
lines(fitted(holt.fit2), col="red", type="o")
lines(fitted(holt.fit3), col="green", type="o")
lines(holt.fit1$mean, col="blue", type="o")
lines(holt.fit2$mean, col="red", type="o")
lines(holt.fit3$mean, col="green", type="o")
legend("bottomleft",lty=1, col=c(1,"blue","red","green"), c("data","Holt1", "Holt2-Exp", "Holt3-damp"), pch=1)
tsdisplay(sample.data$visitors)
tsdisplay(diff(sample.data$visitors,1))
arima.fit <- auto.arima(myseries,stepwise = FALSE, approximation = FALSE)
summary(arima.fit)
## Series: myseries
## ARIMA(2,1,3)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3
## 1.0580 -0.9203 -2.0242 1.8315 -0.7728
## s.e. 0.0383 0.0356 0.0643 0.0914 0.0478
##
## sigma^2 estimated as 118.3: log likelihood=-1429.97
## AIC=2871.94 AICc=2872.17 BIC=2895.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1814535 10.78761 8.647176 -52.97218 76.66563 0.7363533
## ACF1
## Training set -0.07758166
arima_visits <- forecast(arima.fit,h = 14)
# accuracy(arima_visits$fitted,sample.valid)
plot(arima.fit$x, ylab="Daily Vistors", xlab="Days", main="",type="o")
lines(fitted(arima.fit), col="blue", type="o")
lines(fitted(arima.fit$mean), col="blue", type="o")
Similarly, I am going to run and forecast for each restaurant for each month.
Types of Models
In this forecasting project, I have experimented with several forecasting techniques or models that include - Exponential smoothing models, ARIMA models and other advanced modelling techniques like Random Forests and XGBoost.
Formulation
Econometrics based mathematical formulations of these models discussed above are given as following:
* Simple Exponential Smoothing: \(y^t+1^ = alpha x y^t^ + (1-alpha) x y^(t-1)\)
* Holt Linear Trend: $yt+1 = alpha*yt + error $
The number of periods used determines how quickly the averaging will react to changes in actual trends and how sensitive it will be to random variations. The more periods included will make the calculation method more stable from random variations, but it will also react more slowly to changes resulting from real trends.
F(t + 1) = a x M + (1-a) x L(t)
Exponential smoothing: This forecast formula weighs latest base demand value with the smoothing constant alpha, while the previous base forecast value is weighted with (1 - alpha). This is denoted in the following equation:
F(t + 1) = alpha x D(t) + (1-alpha) x F(t)
The value of smoothing constant alpha determines how quickly the forecast will react to changes in actual trends and how sensitive it will be to random variations. The lower the value, the more stable the calculation is from random variations, but it will also react more slowly to changes resulting from real trends. Smoothing constant must be between 0 and 1.
F(t + 1) = ((i) x D(t) + (1 - ((i)) x F(t)
The smoothing constant is recalculated using the following equation:
This forecast formula uses ( values that are adjusted for the current systematic forecast error. A larger mean forecast error results in a higher ( value. This results in a quicker corrections to the forecast towards reflecting actual demand.
Key:
D(t): Base demand at time t.
F(t): Base forecast at time t.
L: Level of demand or mean of demand for all obs. M: Level of demand or mean of demand for latest 25% obs.
Final Modeling
# define the model and order the steps, store the results put them in 1 place, pick the best model , create a loop, subset each stores data,
# forecast using validation data with the best model.
# identify there seperate start and end dates for splitting
storelist<-unique(air_visits$air_store_id)
results<-NULL
#names(results)<-c("modelName","p","d","q","P","D","Q","S")
#for(i in 1:length(air_visits)) {
#Data Prep
#store_i<- subset(air_visits, air_store_id == storelist[i])
#store_i$visit_date<-as.Date(store_i$visit_date)
#split_date<-as.Date(max(store_i$visit_date)) - 13
#store_i.train<-subset(air_visits, store_i$visit_date <= split_date)
#store_i.valid<-subset(air_visits, store_i$visit_date > split_date)
#myts<-ts(store_i.train$visitors,frequency = 7)
# Building Models
#ses.fit<-ses(myts, h=13)
#result<-rbind(results,c("SES",ses.fit$model$par[1],0,0,0,0,0,0))
#arima.fit<-auto.arima(myts, approximation = FALSE)
#arima_visits <- forecast(arima.fit,h = 13)
#result<-rbind(results,c("ARIMA",ses.fit$model$par[1],0,0,0,0,0,0))
#}
# Save Results
# Add other variables in regression if time permits.
Perfomance Accuracy
I have chosen the within-model accuracies by picking least AICc and RMSLE for between-models on the test dataset.
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.2
##
## Attaching package: 'data.table'
## The following object is masked from 'package:purrr':
##
## transpose
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#library(date)
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.4.4
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.4.2
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Loading required package: iterators
## Loading required package: parallel
library(snow)
## Warning: package 'snow' was built under R version 3.4.4
##
## Attaching package: 'snow'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, clusterSplit, makeCluster,
## parApply, parCapply, parLapply, parRapply, parSapply,
## splitIndices, stopCluster
# data load and pre-processing
train_wide <- dcast(
air_visits, air_store_id ~ visit_date, value.var = "visitors", fill = 0)
train_ts <- ts(train_wide[, 2:dim(train_wide)[2]], frequency = 7)
fcst_intv = 39 ### 39 days of forecast horizon
fcst_matrix <- matrix(NA,nrow=4*nrow(train_ts),ncol=fcst_intv)
### forecasting - register cores for parallel processing
workers=makeCluster(4,type="SOCK")
registerDoParallel(workers)
foreach(i=1:4) %dopar% Sys.getpid()
## [[1]]
## [1] 17972
##
## [[2]]
## [1] 20344
##
## [[3]]
## [1] 4676
##
## [[4]]
## [1] 19992
registerDoParallel(detectCores()-1)
fcst_matrix <- foreach(i=1:nrow(train_ts),.combine=rbind, .packages=c("forecast")) %dopar% {
fcst_ets <- forecast(ets(train_ts[i,]),h=fcst_intv)$mean
fcst_nnet <- forecast(nnetar(train_ts[i,]),h=fcst_intv)$mean
fcst_arima <- forecast(auto.arima(train_ts[i,]),h=fcst_intv)$mean
fcst_ses <- forecast(HoltWinters(train_ts[i,], beta=FALSE, gamma=FALSE),h=fcst_intv)$mean
fcst_matrix <- rbind(fcst_ets, fcst_nnet, fcst_arima, fcst_ses)
}
### post-processing the forecast table
fcst_matrix_mix <- aggregate(fcst_matrix,list(rep(1:(nrow(fcst_matrix)/4),each=4)),mean)[-1]
fcst_matrix_mix[fcst_matrix_mix < 0] <- 0
colnames(fcst_matrix_mix) <- as.character(
seq(from = as.Date("2017-04-23"), to = as.Date("2017-05-31"), by = 'day'))
fcst_df <- as.data.frame(cbind(train_wide[, 1], fcst_matrix_mix))
colnames(fcst_df)[1] <- "air_store_id"
### melt the forecast data frame from wide to long format for final submission
fcst_df_long <- melt(
fcst_df, id = 'air_store_id', variable.name = "fcst_date", value.name = 'visitors')
fcst_df_long$air_store_id <- as.character(fcst_df_long$air_store_id)
fcst_df_long$fcst_date <- as.Date(parse_date_time(fcst_df_long$fcst_date,'%y-%m-%d'))
fcst_df_long$visitors <- as.numeric(fcst_df_long$visitors)
### get & process the sample submission file
sample_sub <- fread("sample_submission.csv")
sample_sub$visitors <- NULL
sample_sub$store_id <- substr(sample_sub$id, 1, 20)
sample_sub$visit_date <- substr(sample_sub$id, 22, 31)
sample_sub$visit_date <- as.Date(parse_date_time(sample_sub$visit_date,'%y-%m-%d'))
### generate the final submission file
submission <- left_join(
sample_sub, fcst_df_long, c("store_id" = "air_store_id", 'visit_date' = 'fcst_date'))
submission$visitors[is.na(submission$visitors)] <- 0
final_sub <- select(submission, c('id', 'visitors'))
write.csv(final_sub, "sub_ts_mix.csv", row.names = FALSE)
Limitations
The limitations of this project is in terms of limited availability of the data relating to factors that may affect the flow of restaurant visitors. As I get more and more variables that the visitors information is related with I would expect the causal forecasting results to improve.
Future Work
As next steps, I think I would like to focus on the store(restaurant) level visits and group them on some parameter e.g. size of the store, neighbourhood to tourist places, some sort of rank or tier of the city/town information and then focus the models based on that segmentation.
Learnings
I have honed my forecasting skills during this project. And also learnt that along the way of forecasting equally important is to develop the understanding of data that is available. Asking the right question, developing the right hypotheses are significant things to work on in any modeling technique.