Report on data analysis and prediction methods-Store Item forecasting

Data and evaluation criteria is given on the following link : Store Demand Forecasting

Loading the store item dataset (Training data and Unknown data)

options(scipen = 999)
rm(list = ls())
gc()
starttime=Sys.time()
first_day_of_month_wday <- function(dt) {
  day(dt) <- 1
  wday(dt)
}

storedata_origtrain=fread("E:\\AbhinavB\\Kaggle\\StoreItemDemandPrediction\\train.csv",
                          data.table = FALSE,stringsAsFactors = TRUE)
storedata_origtest=fread("E:\\AbhinavB\\Kaggle\\StoreItemDemandPrediction\\test.csv",data.table = FALSE,stringsAsFactors = TRUE)

Combining Training and Unknown data for adding features

dim(storedata_origtrain)

## [1] 913000      4

dim(storedata_origtest)

## [1] 45000     4

storedata_train=storedata_origtrain
storedata_test=storedata_origtest
storedata_train$id=NA
storedata_train$ind="train"
storedata_test$sales=NA
storedata_test$ind="test"
combined=rbind(storedata_train,storedata_test)
storedata=combined
dim(storedata)

## [1] 958000      6

glimpse(storedata)

## Observations: 958,000
## Variables: 6
## $ date  <fct> 2013-01-01, 2013-01-02, 2013-01-03, 2013-01-04, 2013-01-...
## $ store <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ item  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ sales <int> 13, 11, 14, 13, 10, 12, 10, 9, 12, 9, 9, 7, 10, 12, 5, 7...
## $ id    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ind   <chr> "train", "train", "train", "train", "train", "train", "t...

Date related feature engineering and validation set for testing model on known data

storedata$date2=ymd(storedata$date)
storedata$YEAR=lubridate::year(storedata$date2)
storedata$MNTH=lubridate::month(storedata$date2)
storedata$MNTH_NM=lubridate::month(storedata$date2,label=TRUE)
storedata$Quarter=lubridate::quarter(storedata$date2)
storedata$Semester=lubridate::semester(storedata$date2)
##Day of week month year 
storedata$DayOfWeek=lubridate::wday(storedata$date2)
storedata$DayOfWeek_Name=lubridate::wday(storedata$date2,label=TRUE)
storedata$DayOfMonth=lubridate::mday(storedata$date2)
storedata$DayOfYear=lubridate::yday(storedata$date2)
storedata$DayOfQuarter=lubridate::qday(storedata$date2)
##Week of year & month
storedata$WeekOfYr=lubridate::week(storedata$date2)
storedata$WeekOfMnth=ceiling((day(storedata$date2)+first_day_of_month_wday(storedata$date2)-1)/7)
dim(storedata)
testset=storedata[storedata$ind=="test",]
dim(testset)
train=storedata[storedata$ind=="train",]
dim(train)
print("test and train deMerged")
## train and validation set created
valset=train[train$date2>=as.Date("2017-01-01"),]
dim(valset)
trainset=train[train$date2<as.Date("2017-01-01"),]
dim(trainset)
train=train[,-which(names(train) %in% c("id","ind"))]

Store1 Item 1 sales over different time periods

###Imposing linear regression line on sales scatter plot ###Imposing linear regression line on sales scatter plot for all observations ####All stores year wise sales(for all items) Data shows STORES 2,8,3,10,9,4,1 as leading the way ,with STORES 5,6 following and with STORE 7 as the laggard in terms of sales size. This conclusion assumes that all stores are of the same size (sqft)with roughly same inventory.

All store item wise year wise sales for Item1

Similarly plotting data for Item 2,3,10,20,50 as well The above visuals show that for any item the trend is increasing across stores with time (on available data of 5 years)

Monthly seasonality-Monthly Total Sales Per STORE

The above visuals are showing that mostly sales across all stores are following almost same pattern and monthly seasonality is present.

Weekday Total sales

The above graph shows that Weekend ie Sunday and Saturday drive the sales across all stores.

Create a timeseries object

inds=seq(from=as.Date("2013-01-01"),to=as.Date("2016-12-31"),by="day")   ## this provides clear cut x axis values year wise 
last=length(inds)
Firstdayofyear=as.numeric(format(inds[1], "%j"))
Lastdayofyear=as.numeric(format(inds[last],"%j"))
sales_ts=ts(trainset$sales,start = c(2013,Firstdayofyear),end=c(2016,Lastdayofyear),frequency = 365)
head(sales_ts)
tail(sales_ts)

Decompose the timeseries object into componenents (Trend,seasonal,residuals)

train_ts_comp=decompose(sales_ts)
autoplot(train_ts_comp)

Lets generate a benchmark model using Naive and seasonal approach (we will check the smape on validation set (known data))

valset90=valset$sales[1:90]
mean_model=meanf(sales_ts,h=90)
predicted_mean_model=mean_model$mean
smape(valset90,predicted_mean_model)

## [1] 0.270679

Where smape is symetric mean absolute percentage error (error metric being used for this forecasting exercise) So,smape=0.270679 is the bench mark for us to beat.

Lets try Naive and seasonal Naive methods and generate corresponding smape

naive_model=naive(sales_ts,h=90)
predicted_naive_model=naive_model$mean
smape(valset90,predicted_naive_model)

## [1] 0.2714592

In this case, smape=0.2714592 Lets run seasonal Naive as well

snaive_model=snaive(sales_ts,h=90)
predicted=snaive_model$mean
smape(valset90,predicted)

## [1] 0.338339

In this case,smape=0.338339 ### Lets try the ARIMA approach Begin with checking stationarity of the overall timeseries

acf(sales_ts,lag.max = 60)

Acf plot reveals that autocorr is not cutting off indicating non stationary TS

pacf(sales_ts,lag.max = 60)

Pacf plot reveals that partial autocorr not cutting off either

Acf graph doesnot cut of below significance level hence ts is non stationary

Lets find after how many differences will the timeseries become stationary using R function ‘ndiffs’ and then differencing the timeseries

ndiffs(sales_ts)

## [1] 1

##results in 1

#first differencing 
sales_ts_stnry=diff(sales_ts,differences=1)
plot.ts(sales_ts_stnry)

After making timeseries stationary: we plot acf again to find the order of Moving average process(MA) we plot pacf again to find the order of Auto correlation process(AR) (p,d,q)–>(AR,d,MA) is found by (PACF,ndiff,ACF)

acf(sales_ts_stnry)

pacf(sales_ts_stnry)

p= -7 lags from pacf AR process q= -1 lags from acf MA process d= -1 from ndiffs and sationary TS Hence combinations for ARIMA model (p,d,q) -(0,1,0);(0,0,1);(0,1,1);(7,0,0);(7,1,1)

However grid search gives lowest AIC at (7,1,7)

arima_mod=arima(sales_ts,order=c(7,1,7),method="ML")
arima_mod$aic

## [1] 8637.278

##predicting with the created model
pred=forecast(arima_mod,h=90)
predicted=pred$mean
smape(valset90,predicted)

## [1] 0.2153837

endtime=Sys.time()
timetaken=endtime-starttime
timetaken

## Time difference of 9.009759 secs

Slightly better mean absolute error with simple ARIMA model,lets incorporate seasonal ARIMA component as well.

Store Item demand Forecasting -Exploratory analysis and Report

Abhinav B.

4 January 2020