Data and evaluation criteria is given on the following link : Store Demand Forecasting
options(scipen = 999)
rm(list = ls())
gc()
starttime=Sys.time()
first_day_of_month_wday <- function(dt) {
day(dt) <- 1
wday(dt)
}
storedata_origtrain=fread("E:\\AbhinavB\\Kaggle\\StoreItemDemandPrediction\\train.csv",
data.table = FALSE,stringsAsFactors = TRUE)
storedata_origtest=fread("E:\\AbhinavB\\Kaggle\\StoreItemDemandPrediction\\test.csv",data.table = FALSE,stringsAsFactors = TRUE)
dim(storedata_origtrain)
## [1] 913000 4
dim(storedata_origtest)
## [1] 45000 4
storedata_train=storedata_origtrain
storedata_test=storedata_origtest
storedata_train$id=NA
storedata_train$ind="train"
storedata_test$sales=NA
storedata_test$ind="test"
combined=rbind(storedata_train,storedata_test)
storedata=combined
dim(storedata)
## [1] 958000 6
glimpse(storedata)
## Observations: 958,000
## Variables: 6
## $ date <fct> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-...
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ item <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales <int> 13, 11, 14, 13, 10, 12, 10, 9, 12, 9, 9, 7, 10, 12, 5, 7...
## $ id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ind <chr> "train", "train", "train", "train", "train", "train", "t...
###Imposing linear regression line on sales scatter plot
###Imposing linear regression line on sales scatter plot for all observations
####All stores year wise sales(for all items)
Data shows STORES 2,8,3,10,9,4,1 as leading the way ,with STORES 5,6 following and with STORE 7 as the laggard in terms of sales size. This conclusion assumes that all stores are of the same size (sqft)with roughly same inventory.
Similarly plotting data for Item 2,3,10,20,50 as well
The above visuals show that for any item the trend is increasing across stores with time (on available data of 5 years)
The above visuals are showing that mostly sales across all stores are following almost same pattern and monthly seasonality is present.
The above graph shows that Weekend ie Sunday and Saturday drive the sales across all stores.
inds=seq(from=as.Date("2013-01-01"),to=as.Date("2016-12-31"),by="day") ## this provides clear cut x axis values year wise
last=length(inds)
Firstdayofyear=as.numeric(format(inds[1], "%j"))
Lastdayofyear=as.numeric(format(inds[last],"%j"))
sales_ts=ts(trainset$sales,start = c(2013,Firstdayofyear),end=c(2016,Lastdayofyear),frequency = 365)
head(sales_ts)
tail(sales_ts)
train_ts_comp=decompose(sales_ts)
autoplot(train_ts_comp)
valset90=valset$sales[1:90]
mean_model=meanf(sales_ts,h=90)
predicted_mean_model=mean_model$mean
smape(valset90,predicted_mean_model)
## [1] 0.270679
Where smape is symetric mean absolute percentage error (error metric being used for this forecasting exercise) So,smape=0.270679 is the bench mark for us to beat.
naive_model=naive(sales_ts,h=90)
predicted_naive_model=naive_model$mean
smape(valset90,predicted_naive_model)
## [1] 0.2714592
In this case, smape=0.2714592 Lets run seasonal Naive as well
snaive_model=snaive(sales_ts,h=90)
predicted=snaive_model$mean
smape(valset90,predicted)
## [1] 0.338339
In this case,smape=0.338339 ### Lets try the ARIMA approach Begin with checking stationarity of the overall timeseries
acf(sales_ts,lag.max = 60)
Acf plot reveals that autocorr is not cutting off indicating non stationary TS
pacf(sales_ts,lag.max = 60)
Pacf plot reveals that partial autocorr not cutting off either
Acf graph doesnot cut of below significance level hence ts is non stationary
Lets find after how many differences will the timeseries become stationary using R function ‘ndiffs’ and then differencing the timeseries
ndiffs(sales_ts)
## [1] 1
##results in 1
#first differencing
sales_ts_stnry=diff(sales_ts,differences=1)
plot.ts(sales_ts_stnry)
After making timeseries stationary: we plot acf again to find the order of Moving average process(MA) we plot pacf again to find the order of Auto correlation process(AR) (p,d,q)–>(AR,d,MA) is found by (PACF,ndiff,ACF)
acf(sales_ts_stnry)
pacf(sales_ts_stnry)
p= -7 lags from pacf AR process q= -1 lags from acf MA process d= -1 from ndiffs and sationary TS Hence combinations for ARIMA model (p,d,q) -(0,1,0);(0,0,1);(0,1,1);(7,0,0);(7,1,1)
However grid search gives lowest AIC at (7,1,7)
arima_mod=arima(sales_ts,order=c(7,1,7),method="ML")
arima_mod$aic
## [1] 8637.278
##predicting with the created model
pred=forecast(arima_mod,h=90)
predicted=pred$mean
smape(valset90,predicted)
## [1] 0.2153837
endtime=Sys.time()
timetaken=endtime-starttime
timetaken
## Time difference of 9.009759 secs
Slightly better mean absolute error with simple ARIMA model,lets incorporate seasonal ARIMA component as well.