INTRODUCTION

ABOUT DATA



We are trying to forecast a time series data of a retail company. The data is about sales of the thousands of products of different families sold at stores.The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. The data is publically available on Kaggle website. The dataset have sale details for various products on day to day basis. The data has various aspects which can be explored going ahead. For now, we will be using single time series of total daily sales across all the products and try to forecast the sales with different approaches. I find this tak challenging as it involves various factors such as seasonality,holiday events along with forecasting sales.

Data summary

File Descriptions and Data Field Information

sales_train.csv
The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales. store_nbr identifies the store at which the products are sold. family identifies the type of product sold. on promotion gives the total number of items in a product family that were being promoted at a store at a given date.

Structure of Data


spec_tbl_df [2,935,849 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ date          : Date[1:2935849], format: "2013-01-02" "2013-01-03" ...
 $ date_block_num: num [1:2935849] 0 0 0 0 0 0 0 0 0 0 ...
 $ shop_id       : num [1:2935849] 59 25 25 25 25 25 25 25 25 25 ...
 $ item_id       : num [1:2935849] 22154 2552 2552 2554 2555 ...
 $ item_price    : num [1:2935849] 999 899 899 1709 1099 ...
 $ item_cnt_day  : num [1:2935849] 1 1 -1 1 1 1 1 1 1 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   date = col_character(),
  ..   date_block_num = col_double(),
  ..   shop_id = col_double(),
  ..   item_id = col_double(),
  ..   item_price = col_double(),
  ..   item_cnt_day = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
      date            date_block_num     shop_id      item_id     
 Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
 1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
 Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
 Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
 3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
 Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
   item_price        item_cnt_day     
 Min.   :    -1.0   Min.   : -22.000  
 1st Qu.:   249.0   1st Qu.:   1.000  
 Median :   399.0   Median :   1.000  
 Mean   :   890.9   Mean   :   1.243  
 3rd Qu.:   999.0   3rd Qu.:   1.000  
 Max.   :307980.0   Max.   :2169.000  

BASIC DATA QUALITY CHECKS

Null value check


[1] "Unique block numbers are the months in the data"
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[26] 25 26 27 28 29 30 31 32 33
[1] "Number of NA values in columns are"
$date
[1] 0

$date_block_num
[1] 0

$shop_id
[1] 0

$item_id
[1] 0

$item_price
[1] 0

$item_cnt_day
[1] 0

Summary

      date            date_block_num     shop_id      item_id     
 Min.   :2013-01-01   Min.   : 0.00   Min.   : 0   Min.   :    0  
 1st Qu.:2013-08-01   1st Qu.: 7.00   1st Qu.:22   1st Qu.: 4476  
 Median :2014-03-04   Median :14.00   Median :31   Median : 9343  
 Mean   :2014-04-03   Mean   :14.57   Mean   :33   Mean   :10197  
 3rd Qu.:2014-12-05   3rd Qu.:23.00   3rd Qu.:47   3rd Qu.:15684  
 Max.   :2015-10-31   Max.   :33.00   Max.   :59   Max.   :22169  
   item_price        item_cnt_day     
 Min.   :    -1.0   Min.   : -22.000  
 1st Qu.:   249.0   1st Qu.:   1.000  
 Median :   399.0   Median :   1.000  
 Mean   :   890.9   Mean   :   1.243  
 3rd Qu.:   999.0   3rd Qu.:   1.000  
 Max.   :307980.0   Max.   :2169.000  


We could see in summry that there is some inconsistancy in price and item count per day column. We have some negative values which does not make sense. Lets see in detail.
[1] "Number of negative values in item_cnt_day"
[1] 7356
[1] "Number of negative values in item_price"
[1] 1

Boxplot for both columns

There is value in item price which is too far from other values, as it is only one value there is high chance of error, so we replace that by median and plot new boxplot.


Updated box Plot


Summary statistics

sales
Name Class Values Missing Summary
date Date Time: 2013-01-01 to 2015-10-31 0 mean: 16163.2392936421, sd: 286.899, nuniq: 1034
date_block_num numeric Num: 0 to 33 0 mean: 14.57, sd: 9.423, nuniq: 34
shop_id numeric Num: 0 to 59 0 mean: 33.002, sd: 16.227, nuniq: 60
item_id numeric Num: 0 to 22169 0 mean: 10197.227, sd: 6324.297, nuniq: 21807
item_price numeric Num: -1 to 307980 0 mean: 890.853, sd: 1729.8, nuniq: 19993
item_cnt_day numeric Num: -22 to 2169 0 mean: 1.243, sd: 2.619, nuniq: 198
Creating our time series to track total daily sales. Keeping only two columns hereafter.
      date             daily_Sales    
 Min.   :2013-01-01   Min.   : 11.94  
 1st Qu.:2013-09-16   1st Qu.: 20.96  
 Median :2014-06-01   Median : 26.64  
 Mean   :2014-06-01   Mean   : 32.87  
 3rd Qu.:2015-02-14   3rd Qu.: 36.94  
 Max.   :2015-10-31   Max.   :365.45  

Daily total sales plot

Looking at plot, the series looks quite stationary, but have seasonal factors involved. The trend and variance seems to be same overall series.

Cummulative sales per day.




APPROACHING TO MODELLING

Understanding the series

UNDERSTANDING THE SERIES

The series appeared to be bit stationary before in terms of both variance and mean. Plotting again with trend line.The trend is bit constant here.

Lets see if we have any auto-correlation in the series to move ahead

ACF & PACF GRAPHS


From above plots we could see that there is strong autocorrelation in the data

Checking wheather the given series is stationary or not


    Augmented Dickey-Fuller Test

data:  sales$daily_Sales
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

    KPSS Test for Level Stationarity

data:  sales$daily_Sales
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208

The ADF test has Null hypothesis that given series is not stationary and KPSS test has Null Hypothesis as the given series is stationary. Thus both test suggest that given series is already stationary.

We will try to check how much difference it makes if we transformed the given series.

Lets check the histogram of our sales

The distribution looks right skewed.

We will log transform the series and plot the histogram.

Now the distribution of the sales is close to normal. But Remember, we dont look for the ditribution of Y variable to be normal, that is not our concern. So we just leave it here.

Lets try box-cox transformation and check

There seem to be no difference in original and transformed series here, we have already checked that our series is stationary. Box-Cox performed well here by not doing anything.

Comparing histogram of all three

Performing stationarity test on all.


    Augmented Dickey-Fuller Test

data:  train_log$sale_log
Dickey-Fuller = -5.3587, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

    KPSS Test for Level Stationarity

data:  train_log$sale_log
KPSS Level = 0.6095, Truncation lag parameter = 7, p-value = 0.02177
[1] "'\n      "

    Augmented Dickey-Fuller Test

data:  sales$daily_Sales
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

    KPSS Test for Level Stationarity

data:  sales$daily_Sales
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208

    Augmented Dickey-Fuller Test

data:  train_box_cox$sale_box_cox
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary

    KPSS Test for Level Stationarity

data:  train_box_cox$sale_box_cox
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208

Both test indicates that the series is stationary As our series is already stationary, we will move ahead with orginal series.

FINALISING THE MODEL

As our series is now stationary lets try to figure out what type of series it is.

Checking for auto correlation.

The above ACF and PACF plot shows us that there is high auto correlation. The ACF has dampning effect and PACF shows no significant spike after 7. This confirms that this is AR process. The spikes in PACF are significant at level 3 and 7. so we will check for both levels.


Call:
arima(x = sales$daily_Sales, order = c(7, 0, 0))

Coefficients:
         ar1     ar2     ar3     ar4      ar5     ar6     ar7  intercept
      0.4185  0.0377  0.0834  0.0072  -0.0030  0.0735  0.1731    32.8466
s.e.  0.0306  0.0333  0.0333  0.0334   0.0332  0.0332  0.0306     2.6326

sigma^2 estimated as 321.5:  log likelihood = -4452.28,  aic = 8922.55

Training set error measures:
                       ME     RMSE      MAE       MPE     MAPE     MASE
Training set -0.008477461 17.93103 8.288944 -11.71179 24.19535 0.886353
                    ACF1
Training set 0.001486589



    Ljung-Box test

data:  Residuals from ARIMA(7,0,0) with non-zero mean
Q* = 3.0945, df = 3, p-value = 0.3773

Model df: 8.   Total lags used: 11

The residual are looking Good. But there seems to be litle seaonal factor left according to residual graphs. But we seems to be close to the best model. Here we will try different models near to (3,0,0) and comapre there AIC’s.


Call:
arima(x = sales$daily_Sales, order = c(3, 0, 0))

Coefficients:
         ar1     ar2     ar3  intercept
      0.4687  0.0540  0.1358    32.8368
s.e.  0.0308  0.0341  0.0308     1.6808

sigma^2 estimated as 342.4:  log likelihood = -4484.62,  aic = 8979.24

Training set error measures:
                     ME     RMSE      MAE       MPE     MAPE      MASE
Training set 0.01313103 18.50431 9.040962 -13.37059 27.32408 0.9667678
                     ACF1
Training set -0.008090044


    Ljung-Box test

data:  Residuals from ARIMA(3,0,0) with non-zero mean
Q* = 65.071, df = 6, p-value = 4.171e-12

Model df: 4.   Total lags used: 10

If we compare the two, we see that there are still some significant spike left at lag 7 in ARIMA(3,0,0). There are no significant spikes for model ARIMA(7,0,0). The error distribution looks normal expect some large error at last. That might be the error becuase of some high seasonal value. Overall the ARIMA(7,0,0) looks good.


To further check if any other close model is better than ARIMA(7,0,0). We will compare the AIC’s of them.

                                             df      AIC
arima(sales$daily_Sales, order = c(7, 0, 0))  9 8922.551
arima(sales$daily_Sales, order = c(8, 0, 0)) 10 8924.506
arima(sales$daily_Sales, order = c(4, 0, 0))  6 8977.230
arima(sales$daily_Sales, order = c(5, 0, 0))  7 8973.788
arima(sales$daily_Sales, order = c(6, 0, 0))  8 8952.044
arima(sales$daily_Sales, order = c(7, 0, 1)) 10 8924.527
arima(sales$daily_Sales, order = c(6, 0, 1))  9 8937.107
arima(sales$daily_Sales, order = c(5, 0, 1))  8 8949.164
arima(sales$daily_Sales, order = c(9, 0, 0)) 11 8921.953

The model looks better comapared to others. But we got some low values when we added MA component in the model. Also when we look at the residual plot, we could see there some spikes at regular intervals. This might be taken care if we add MA component. Lets try different values and compare the AIC’s again.

                                             df      AIC
arima(sales$daily_Sales, order = c(7, 0, 2)) 11 8922.915
arima(sales$daily_Sales, order = c(7, 0, 3)) 12 8916.015
arima(sales$daily_Sales, order = c(7, 0, 4)) 13 8916.173
arima(sales$daily_Sales, order = c(6, 0, 2)) 10 8912.419
arima(sales$daily_Sales, order = c(6, 0, 3)) 11 8904.982
arima(sales$daily_Sales, order = c(6, 0, 4)) 12 8908.598
arima(sales$daily_Sales, order = c(5, 0, 2))  9 8919.983
arima(sales$daily_Sales, order = c(5, 0, 3)) 10 8916.787
arima(sales$daily_Sales, order = c(8, 0, 4)) 14 8918.459
arima(sales$daily_Sales, order = c(8, 0, 2)) 12 8924.735
arima(sales$daily_Sales, order = c(8, 0, 3)) 13 8926.730
                                             df      AIC
arima(sales$daily_Sales, order = c(3, 0, 0))  5 8979.244
arima(sales$daily_Sales, order = c(3, 0, 1))  6 8950.477
arima(sales$daily_Sales, order = c(3, 0, 2))  7 8949.107
arima(sales$daily_Sales, order = c(3, 0, 3))  8 8954.167
arima(sales$daily_Sales, order = c(4, 0, 1))  7 8952.477
arima(sales$daily_Sales, order = c(4, 0, 2))  8 8954.092
arima(sales$daily_Sales, order = c(4, 0, 3))  9 8914.693
arima(sales$daily_Sales, order = c(5, 0, 1))  8 8949.164
arima(sales$daily_Sales, order = c(5, 0, 2))  9 8919.983
arima(sales$daily_Sales, order = c(5, 0, 3)) 10 8916.787
arima(sales$daily_Sales, order = c(2, 0, 1))  5 8949.963

We could see that we got the lowest AIC for ARIMA(6,0,4). Lets try to check its residual graphs.


    Ljung-Box test

data:  Residuals from ARIMA(6,0,3) with non-zero mean
Q* = 15.419, df = 3, p-value = 0.001492

Model df: 10.   Total lags used: 13

The AIC is reduced compare to ARIMA(7,0,0), but residual has still some spikes left. Keeping the lag at 7 in mind, we will try MA component for lag 7.


    Ljung-Box test

data:  Residuals from ARIMA(7,0,3) with non-zero mean
Q* = 19.952, df = 3, p-value = 0.0001737

Model df: 11.   Total lags used: 14

We are still left with the spike. We will try the next best AIC that is ARIMA (7,0,2).


    Ljung-Box test

data:  Residuals from ARIMA(7,0,2) with non-zero mean
Q* = 1.9906, df = 3, p-value = 0.5744

Model df: 10.   Total lags used: 13

The model chosen at start ARIMA(7,0,0) is gave closely similar results to ARIMA(7,0,2). Lets try to compare them.


Call:
arima(x = sales$daily_Sales, order = c(7, 0, 0))

Coefficients:
         ar1     ar2     ar3     ar4      ar5     ar6     ar7  intercept
      0.4185  0.0377  0.0834  0.0072  -0.0030  0.0735  0.1731    32.8466
s.e.  0.0306  0.0333  0.0333  0.0334   0.0332  0.0332  0.0306     2.6326

sigma^2 estimated as 321.5:  log likelihood = -4452.28,  aic = 8922.55

Training set error measures:
                       ME     RMSE      MAE       MPE     MAPE     MASE
Training set -0.008477461 17.93103 8.288944 -11.71179 24.19535 0.886353
                    ACF1
Training set 0.001486589

Call:
arima(x = sales$daily_Sales, order = c(7, 0, 2))

Coefficients:
         ar1      ar2     ar3     ar4     ar5     ar6     ar7      ma1     ma2
      0.6117  -0.2929  0.1873  0.0006  0.0174  0.0797  0.1640  -0.1936  0.2589
s.e.  0.2323   0.2007  0.0606  0.0412  0.0388  0.0383  0.0502   0.2352  0.1229
      intercept
        32.8431
s.e.     2.5282

sigma^2 estimated as 320.4:  log likelihood = -4450.46,  aic = 8922.91

Training set error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.00752713 17.89927 8.219159 -11.61345 23.95606 0.8788907
                     ACF1
Training set 0.0009992537

There is not much of difference in the models, thus we will go ahead with a simpler model .i.e ARIMA(7,0,0)

We havent checked the ARIMA(3,0,0) counter parts,as we have selected ARIMA(7,0,0) now, we have options of ARIMA(3,0,0) counterparts with lower AIC’s. Lets check there residuals


    Ljung-Box test

data:  Residuals from ARIMA(4,0,3) with non-zero mean
Q* = 34.42, df = 3, p-value = 1.616e-07

Model df: 8.   Total lags used: 11


    Ljung-Box test

data:  Residuals from ARIMA(5,0,3) with non-zero mean
Q* = 34.14, df = 3, p-value = 1.851e-07

Model df: 9.   Total lags used: 12

The spike at 7 is significant in all the models except ARIMA(7,0,0). Thus we will select ARIMA(7,0,0) as our best model. This could be the result of weekly seasonality present in the data set.



From both AIC and BIC, the best model is ARIMA(7,0,0)



    Ljung-Box test

data:  Residuals from ARIMA(7,0,0) with non-zero mean
Q* = 3.0945, df = 3, p-value = 0.3773

Model df: 8.   Total lags used: 11

AUTO ARIMA SUGGESTIONS.

Fitting the model

Series: sales$daily_Sales 
ARIMA(1,0,2) with non-zero mean 

Coefficients:
         ar1     ma1      ma2     mean
      0.9424  -0.515  -0.2076  32.8992
s.e.  0.0173   0.036   0.0333   2.6977

sigma^2 = 333.7:  log likelihood = -4469.45
AIC=8948.91   AICc=8948.96   BIC=8973.61

Training set error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.01805955 18.23338 8.830803 -12.60878 26.23437 0.9442951
                    ACF1
Training set 0.004409511


We are getting different model ARIMA(1,0,2) than our best estimation from auto.arima

Checking the residuals.

Performing Ljung Box test for the both best model and auto model at different lags, the Ljung-Box test has hypothesis that the residuals are independent. Thus the P value less than 0.05 indicates that there exist a correlation in the residuals.


    Box-Ljung test

data:  best_fit$residuals
X-squared = 0.0022917, df = 1, p-value = 0.9618

    Box-Ljung test

data:  auto_fit$residuals
X-squared = 0.020163, df = 1, p-value = 0.8871



    Box-Ljung test

data:  best_fit$residuals
X-squared = 0.35142, df = 3, p-value = 0.9501

    Box-Ljung test

data:  auto_fit$residuals
X-squared = 0.36797, df = 3, p-value = 0.9468


    Box-Ljung test

data:  best_fit$residuals
X-squared = 0.48931, df = 7, p-value = 0.9995

    Box-Ljung test

data:  auto_fit$residuals
X-squared = 29.999, df = 7, p-value = 9.501e-05

    Box-Ljung test

data:  best_fit$residuals
X-squared = 2.8374, df = 10, p-value = 0.985

    Box-Ljung test

data:  auto_fit$residuals
X-squared = 37.105, df = 10, p-value = 5.427e-05

The auto arima model fails the Ljung-Box test at lag seven,thus our best model will be ARIMA(7,0,0)

IN SAMPLE RMSE


[1] 17.93103
[1] 18.23338


We now will calculate MAE for fitted values

[1] 8.288944
[1] 8.830803

The MAE is less for the model we choose




   pred.pred        pred.se           date                    
 Min.   :26.55   Min.   :17.93   Min.   :2015-10-31 12:00:00  
 1st Qu.:29.83   1st Qu.:21.07   1st Qu.:2015-11-07 18:00:00  
 Median :30.90   Median :22.03   Median :2015-11-15 00:00:00  
 Mean   :30.39   Mean   :21.53   Mean   :2015-11-15 00:00:00  
 3rd Qu.:31.64   3rd Qu.:22.35   3rd Qu.:2015-11-22 06:00:00  
 Max.   :32.11   Max.   :22.46   Max.   :2015-11-29 12:00:00  

USING FB PROPHET

Starting with PROPHET

CHANGEPOINTS

Plotting Changepoint

Identifying change points for trend in the series.The single changepoint detected is in January 2015, where we can see that the next trend is bit downward side.

[1] 748

Trying to check for multiple changing point with mean values.

The multiple change points can be seen in the plot.

OUTLIERS

Box Plot for detecting outliers

The outliers according to the boxplot above are highlighted below. One high value above 300 looks abnormal, but it lies in peak seasonal period, thus cannot be ruled out. Keepinng as it is.

Plotting all the outliers from the boxplot in the series.



cleaned_data = tsclean(sales$daily_Sales)
clean_full_y = sales %>% 
  mutate(
    row_number = row_number(),
    anomaly = if_else(row_number %in% outliers$index,TRUE,as.logical(NA))
    #cleaned_data = cleaned_data
  ) %>% 
  ggplot()+
  geom_line(aes(date,cleaned_data))+
  geom_point(aes(date,cleaned_data,color=anomaly))+
  theme_bw()+
  ylim(0,400)+
  guides(color='none')+
  labs(title = 'Outliers Replaced') + 
  scale_colour_discrete(na.translate = F)

plot = clean_full_y

clean_full_y

FITTING THE PROPHET MODEL

The model is fitted on train data till date 2015-07-01 and forecast is done fo period of 120 days.

Model automatically disable daily seasonality as it detects there could be no daily seasonality present.

We try check if there is any difference by turning on the daily seasonality or not. #### Ckeck by Turning on the daily seasonality

.

Not much of a difference here. Thus model correctly disabled the daily seasonality

Lets see our original series for comparison


Interactive chart to check for perticular period. (Use scroller at bottom)

CHECKING FOR COMPONENTS OF TIME SERIES

Plot shows the different components which includes, trend, weekly seasonality and yearly seasonality.

Checking changepoint with phophet model

Plot shows that two three equidistant change points. The model includes one change point found using boxplot.

COMPARING FORECAST

Comparing forecast with actual observed data. The model fits well but has greater variance at start.

Trying to forecast it for 250 days

.

The prophet is considering seasonality correctly, We dont need to take account for daily seasonality here. We can check for additive and multiplicative seasonality for considration.

SEASONALITY

Additive Seasonality

Plotting additive components.



Multiplicative seasonality

Multiplicative components

There is not much of difference between both.

Checking Holiday Impact

Leveraging the holiday list available in the prophet model,We could see below which holidays has greater Impact.

Checking the amount of impact of two main holidays. New year and Christmas

          ds Christmas Day New Year's Day
1 2014-12-25      -18.6936        0.00000
2 2015-01-01        0.0000      -57.80842

Holiday Impact by each holiday in the US calender.

MODEL ‘FIT’ & CROSS VALIDATION

Checking In sample performance by Calculating the MSE,MAE and MAPE of the model.

The error values are not large, the fit is good. Generally, Prophet works good on daily data and with seasonality.

[1] "RMSE: 10.91"
[1] "MAE: 8.67"
[1] "MAPE: 0.38"

Cross Validation

Cross- Validation for the window of 30 days.

Cut-off points for the cross validation

[1] "2014-12-02 GMT" "2015-01-01 GMT" "2015-01-31 GMT" "2015-03-02 GMT"
[5] "2015-04-01 GMT" "2015-05-01 GMT" "2015-05-31 GMT"

Plotting the comparison of cross validation and actual for different cut-off points.

Conclusion

This is our Final part. Lets conclude our best model after comparing in sample and out of sample performance. Here we have our best model as ARIMA(7,0,0) which we have chosen by trial and error. The auto model is model suggested by auto.arima and last is model by using prophet library.

Cross validation plot for PROPHET

CROSS VALIDATION GRAPH FOR ARIMA MODELS

COMPARING IN SAMPLE RMSE AND MAE

       Model In.Sample.RMSE In.Sample.MAE
1      ARIMA         17.931         8.289
2 AUTO ARIMA         18.233         8.831
3    PROPHET         10.911         8.666

LOOKING ALL THE PLOTS AND VALUES ABOVE, WE CONCLUDE THAT PROPHET HAS OUTPERFORMED ALL THE MODELS. THE PROPHET HAS PROVED ITS PREDICTIVE POWER FOR SEASONAL DATA SETS.

THUS OUR WINNER IS ‘PROPHET’ MODEL.

---
title: "Forecasating the sales data"
output: 
  flexdashboard::flex_dashboard:
             social: menu
             source: embed
             vertical_layout: fill
             orientation: columns
             theme: cerulean
             
---


```{r,echo=FALSE,include=FALSE}
rm(list = ls())

#install.packages("tidyverse")
#install.packages("datatable")

library(dplyr)
library(tidyverse)
library(lubridate)
library(forecast) #boxcox
library(tseries)
library(vtable)
library(flexdashboard) #Presentation quality
library(data.table)



```


INTRODUCTION
=======================================================================

## ABOUT DATA {.tabset}



We are trying to forecast a time series data of a retail company. The data is about sales of the thousands of products of different families sold at stores.The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. The data is publically available on Kaggle website. The dataset have sale details for various products on day to day basis. The data has various aspects which can be explored going ahead. For now, we will be using single time series of total daily sales across all the products and try to forecast the sales with different approaches. I find this tak challenging as it involves various factors such as seasonality,holiday events along with forecasting sales. ### Data summary #### File Descriptions and Data Field Information sales_train.csv
The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales. store_nbr identifies the store at which the products are sold. family identifies the type of product sold. on promotion gives the total number of items in a product family that were being promoted at a store at a given date. ```{r,echo=FALSE,include=FALSE} sales <- readr::read_csv('D:\\VAIBHAV\\HOMEWORK\\Time Series\\ASSIGNMENT\\sales_train.csv') ``` #### Structure of Data
```{r,echo=FALSE} #str(sales) sales$date<- dmy(sales$date) str(sales) ``` ```{r,echo=FALSE} summary(sales) ``` ### BASIC DATA QUALITY CHECKS #### Null value check
```{r,echo=FALSE} attach(sales) print("Unique block numbers are the months in the data") unique(date_block_num) print("Number of NA values in columns are") lapply(sales,function(x) { length(which(is.na(sales)))}) ``` #### Summary ```{r,echo=FALSE} summary(sales) ```
We could see in summry that there is some inconsistancy in price and item count per day column. We have some negative values which does not make sense. Lets see in detail. ```{r,echo=FALSE} print('Number of negative values in item_cnt_day') sum(item_cnt_day<0) print("Number of negative values in item_price") sum(item_price<0) ``` #### Boxplot for both columns ```{r,echo=FALSE} boxplot(item_price,main = " Item Price") boxplot(item_cnt_day, main = "Item count per day") ``` There is value in item price which is too far from other values, as it is only one value there is high chance of error, so we replace that by median and plot new boxplot. ```{r,echo=FALSE} item_price[item_price < 0] <- 0 item_cnt_day[item_cnt_day < 0] <- 0 item_price[item_price > 307970] <- median(item_price) ```
#### Updated box Plot ```{r,echo=FALSE} boxplot(item_price,main="Item Price") boxplot(item_cnt_day,main="Item count per day") ```
#### Summary statistics ```{r,echo=FALSE} vtable(sales,lush = TRUE) ``` Creating our time series to track total daily sales. Keeping only two columns hereafter. ```{r,echo=FALSE} sales <- sales %>% group_by(date) %>% mutate(daily_Sales= sum(item_price*item_cnt_day)/100000) %>% arrange(date) %>% ungroup() sales <- subset(sales,select = c(date,daily_Sales)) sales <- sales %>% group_by(date) %>% filter(!duplicated(date)) ``` ```{r,echo=FALSE} summary(sales) ``` #### Daily total sales plot ```{r,echo=FALSE} sales %>% ggplot()+ geom_line(aes(date,daily_Sales))+ theme_bw()+ xlab("Date")+ ylab("TOTAL DAILY SALES") ``` Looking at plot, the series looks quite stationary, but have seasonal factors involved. The trend and variance seems to be same overall series. #### Cummulative sales per day. ```{r,echo=FALSE} sales %>% ggplot(aes(colour="Blue"))+ geom_line(aes(sales$date,cumsum(sales$daily_Sales)),size=1.5,show.legend = FALSE)+ theme_bw()+ ggtitle("TOTAL CUMMULATIVE SALES BY DATE")+ ylab("Cummulative Sales")+ xlab("DATE") #?GeomLine ```


APPROACHING TO MODELLING ======================================================================= ## Understanding the series {.tabset} ### UNDERSTANDING THE SERIES ```{r,echo=FALSE,include=FALSE} sales %>% ggplot()+ geom_line(aes(sales$date,sales$daily_Sales))+ theme_bw()+ #geom_smooth(aes(sales$date,sales$daily_Sales),method = "lm")+ ggtitle("TOATAL DAILY SALES")+ xlab("Date")+ ylab("Total daily sales") ``` The series appeared to be bit stationary before in terms of both variance and mean. Plotting again with trend line.The trend is bit constant here. #### Lets see if we have any auto-correlation in the series to move ahead #### ACF & PACF GRAPHS ```{r,echo=FALSE,message=FALSE,warning=FALSE} attach(sales) par(mfrow=c(1,2)) acf(sales$daily_Sales,lag.max = 40) Pacf(sales$daily_Sales, lag.max = 40) ```
From above plots we could see that there is strong autocorrelation in the data #### Checking wheather the given series is stationary or not ```{r,echo=FALSE,warning=FALSE} adf.test(sales$daily_Sales) kpss.test(sales$daily_Sales) ``` The ADF test has Null hypothesis that given series is not stationary and KPSS test has Null Hypothesis as the given series is stationary. Thus both test suggest that given series is already stationary. We will try to check how much difference it makes if we transformed the given series. #### Lets check the histogram of our sales ```{r,echo=FALSE} hist(sales$daily_Sales,main="Daily sales",xlab = "DAILY SALES", ylab = "Frequency") ``` The distribution looks right skewed. #### We will log transform the series and plot the histogram. ```{r,echo=FALSE} sales_1 <- sales train_log = sales_1 %>% mutate(sale_log = log1p(daily_Sales)) train_log %>% ggplot()+ geom_line(aes(date,sale_log))+ theme_bw()+ ggtitle("SALES over Time (Log)")+ ylab("DAILY SALE (Log)")+ xlab("Date") ``` Now the distribution of the sales is close to normal. But Remember, we dont look for the ditribution of Y variable to be normal, that is not our concern. So we just leave it here. #### Lets try box-cox transformation and check ```{r,echo=FALSE} train_box_cox <- sales_1 %>% mutate(sale_box_cox = forecast::BoxCox(daily_Sales,lambda="auto")) train_box_cox %>% ggplot()+ geom_line(aes(date,sale_box_cox))+ theme_bw()+ ggtitle("SALES PER DAY-BOX-COX TRANSFORMED")+ ylab("TOTAL SALES") ``` There seem to be no difference in original and transformed series here, we have already checked that our series is stationary. Box-Cox performed well here by not doing anything. #### Comparing histogram of all three ```{r,echo=FALSE} hist(sales$daily_Sales,main="Daily sales",xlab = "DAILY SALES", ylab = "Frequency") hist(train_log$sale_log,main="Daily sales Log transformed",xlab = "DAILY SALES", ylab = "Frequency") hist(train_box_cox$sale_box_cox,main="Daily sales box-cox transformed",xlab = "DAILY SALES", ylab = "Frequency") ``` #### Performing stationarity test on all. ```{r,echo=FALSE} adf.test(train_log$sale_log) kpss.test(train_log$sale_log) print("' ") adf.test(sales$daily_Sales) kpss.test(sales$daily_Sales) adf.test(train_box_cox$sale_box_cox) kpss.test(train_box_cox$sale_box_cox) ``` Both test indicates that the series is stationary As our series is already stationary, we will move ahead with orginal series. ### FINALISING THE MODEL #### As our series is now stationary lets try to figure out what type of series it is. Checking for auto correlation. ```{r,echo=FALSE} par(mfrow=c(1,2)) acf(sales$daily_Sales,main="Daily sales ACF") pacf(sales$daily_Sales,main="Daily sales PACF") ``` The above ACF and PACF plot shows us that there is high auto correlation. The ACF has dampning effect and PACF shows no significant spike after 7. This confirms that this is AR process. The spikes in PACF are significant at level 3 and 7. so we will check for both levels. ```{r,echo=FALSE} model_1 = arima(sales$daily_Sales,order = c(7,0,0)) summary(model_1) ```
```{r,echo=FALSE} forecast::checkresiduals(model_1) ``` The residual are looking Good. But there seems to be litle seaonal factor left according to residual graphs. But we seems to be close to the best model. Here we will try different models near to (3,0,0) and comapre there AIC's. ```{r,echo=FALSE} model_2 = arima(sales$daily_Sales,order = c(3,0,0)) summary(model_2) ``` ```{r,echo=FALSE} forecast::checkresiduals(model_2) ``` If we compare the two, we see that there are still some significant spike left at lag 7 in ARIMA(3,0,0). There are no significant spikes for model ARIMA(7,0,0). The error distribution looks normal expect some large error at last. That might be the error becuase of some high seasonal value. Overall the ARIMA(7,0,0) looks good.
#### To further check if any other close model is better than ARIMA(7,0,0). We will compare the AIC's of them. ```{r,echo=FALSE} AIC( arima(sales$daily_Sales,order=c(7,0,0)), arima(sales$daily_Sales,order=c(8,0,0)), arima(sales$daily_Sales,order=c(4,0,0)), arima(sales$daily_Sales,order=c(5,0,0)), arima(sales$daily_Sales,order=c(6,0,0)), arima(sales$daily_Sales,order=c(7,0,1)), arima(sales$daily_Sales,order=c(6,0,1)), arima(sales$daily_Sales,order=c(5,0,1)), arima(sales$daily_Sales,order=c(9,0,0)) ) ``` The model looks better comapared to others. But we got some low values when we added MA component in the model. Also when we look at the residual plot, we could see there some spikes at regular intervals. This might be taken care if we add MA component. Lets try different values and compare the AIC's again. ```{r,echo=FALSE} AIC( arima(sales$daily_Sales,order=c(7,0,2)), arima(sales$daily_Sales,order=c(7,0,3)), arima(sales$daily_Sales,order=c(7,0,4)), arima(sales$daily_Sales,order=c(6,0,2)), arima(sales$daily_Sales,order=c(6,0,3)), arima(sales$daily_Sales,order=c(6,0,4)), arima(sales$daily_Sales,order=c(5,0,2)), arima(sales$daily_Sales,order=c(5,0,3)), arima(sales$daily_Sales,order=c(8,0,4)), arima(sales$daily_Sales,order=c(8,0,2)), arima(sales$daily_Sales,order=c(8,0,3)) ) ``` ```{r,echo=FALSE} AIC( arima(sales$daily_Sales,order=c(3,0,0)), arima(sales$daily_Sales,order=c(3,0,1)), arima(sales$daily_Sales,order=c(3,0,2)), arima(sales$daily_Sales,order=c(3,0,3)), arima(sales$daily_Sales,order=c(4,0,1)), arima(sales$daily_Sales,order=c(4,0,2)), arima(sales$daily_Sales,order=c(4,0,3)), arima(sales$daily_Sales,order=c(5,0,1)), arima(sales$daily_Sales,order=c(5,0,2)), arima(sales$daily_Sales,order=c(5,0,3)), arima(sales$daily_Sales,order=c(2,0,1)) ) ``` We could see that we got the lowest AIC for ARIMA(6,0,4). Lets try to check its residual graphs. ```{r,echo=FALSE} model_7 <- arima(sales$daily_Sales,order = c(6,0,3)) forecast::checkresiduals(model_7) ``` The AIC is reduced compare to ARIMA(7,0,0), but residual has still some spikes left. Keeping the lag at 7 in mind, we will try MA component for lag 7. ```{r,echo=FALSE} model_3 <- arima(sales$daily_Sales,order = c(7,0,3)) forecast::checkresiduals(model_3) ``` We are still left with the spike. We will try the next best AIC that is ARIMA (7,0,2). ```{r,echo=FALSE} model_4 <- arima(sales$daily_Sales,order = c(7,0,2)) forecast::checkresiduals(model_4) ``` The model chosen at start ARIMA(7,0,0) is gave closely similar results to ARIMA(7,0,2). Lets try to compare them. ```{r,echo=FALSE} summary(model_1) summary(model_4) ``` There is not much of difference in the models, thus we will go ahead with a simpler model .i.e ARIMA(7,0,0) We havent checked the ARIMA(3,0,0) counter parts,as we have selected ARIMA(7,0,0) now, we have options of ARIMA(3,0,0) counterparts with lower AIC's. Lets check there residuals ```{r,echo=FALSE} model_5 <- arima(sales$daily_Sales,order = c(4,0,3)) forecast::checkresiduals(model_5) model_6 <- arima(sales$daily_Sales,order = c(5,0,3)) forecast::checkresiduals(model_6) ``` The spike at 7 is significant in all the models except ARIMA(7,0,0). Thus we will select ARIMA(7,0,0) as our best model. This could be the result of weekly seasonality present in the data set.

#### From both AIC and BIC, the best model is ARIMA(7,0,0)
```{r,echo=FALSE} best_fit <- arima(sales$daily_Sales,order = c(7,0,0)) forecast::checkresiduals(best_fit) ``` ### AUTO ARIMA SUGGESTIONS. #### Fitting the model ```{r,echo=FALSE} auto_fit <- auto.arima(sales$daily_Sales,seasonal= TRUE,stepwise=FALSE,approximation=FALSE,max.p = 15,max.q = 15) summary(auto_fit) ```
We are getting different model ARIMA(1,0,2) than our best estimation from auto.arima #### Checking the residuals. Performing Ljung Box test for the both best model and auto model at different lags, the Ljung-Box test has hypothesis that the residuals are independent. Thus the P value less than 0.05 indicates that there exist a correlation in the residuals. ```{r,echo=FALSE} Box.test(best_fit$residuals,type = "Ljung-Box",lag=1) Box.test(auto_fit$residuals,type = "Ljung-Box",lag=1) ```
```{r,echo=FALSE} Box.test(best_fit$residuals,type = "Ljung-Box",lag=3) Box.test(auto_fit$residuals,type = "Ljung-Box",lag=3) ```
```{r,echo=FALSE} Box.test(best_fit$residuals,type = "Ljung-Box",lag=7) Box.test(auto_fit$residuals,type = "Ljung-Box",lag=7) ``` ```{r,echo=FALSE} Box.test(best_fit$residuals,type = "Ljung-Box",lag=10) Box.test(auto_fit$residuals,type = "Ljung-Box",lag=10) ``` #### The auto arima model fails the Ljung-Box test at lag seven,thus our best model will be ARIMA(7,0,0) ### IN SAMPLE RMSE ```{r,echo=FALSE} resid_1 = best_fit$residuals resid_2 = auto_fit$residuals pred_1 = sales$daily_Sales-resid_1 pred_2 = sales$daily_Sales-resid_2 ggplot()+ geom_line(aes(sales$daily_Sales,pred_1),color='red',alpha=0.7)+ geom_line(aes(sales$daily_Sales,pred_2),color='blue',alpha=0.7)+ ggtitle("BEST MODEL VS AUTO ARIMA")+ xlab("Daily sales")+ ylab("Prediction") # geom_line(aes(train_main_1$Week,train_main_1$week_bx)) ```
```{r,echo=FALSE} RMSE_best = sqrt(mean((pred_1 - sales$daily_Sales)^2)) RMSE_best RMSE_auto = sqrt(mean((pred_2 - sales$daily_Sales)^2)) RMSE_auto ```
#### We now will calculate MAE for fitted values ```{r,echo=FALSE} MAE_best = mean(abs(pred_1 - sales$daily_Sales)) MAE_best MAE_auto = mean(abs(pred_2 - sales$daily_Sales)) MAE_auto ``` #### The MAE is less for the model we choose
```{r,echo=FALSE} best_fit %>% forecast(h=30) %>% autoplot() ```
```{r,echo=FALSE} auto_fit %>% forecast(h=30) %>% autoplot() ```
```{r, echo=FALSE,warning=FALSE} #length(sales$date) prediction = predict(best_fit,n.ahead=30) pred_data = data.frame( pred = prediction, date = seq(c(ISOdate(2015,10,31)), by = "day", length.out = 30) #as.Date(ymd(max(sales$date)):ymd(max(sales$date)+days(29)),origin="2013-01-01") ) summary(pred_data) pred_merge = sales %>% full_join(pred_data) %>% filter(date>ymd("2015-09-01")) %>% #& date<=ymd("2015-03-01")) %>% mutate( pred_high = pred.pred+2*pred.se, pred_low = pred.pred - 2*pred.se, pred.pred = pred.pred, error = pred.pred - daily_Sales ) pred_merge %>% ggplot()+ geom_line(aes(date,daily_Sales))+ geom_line(aes(date,pred.pred),color='blue')+ xlab("DATE - YEAR 2015")+ ylab("SALES")+ geom_ribbon(aes(x=date,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2) ``` ```{r,echo=FALSE} ``` USING FB PROPHET ======================================================================= ## Starting with PROPHET{.tabset} ### CHANGEPOINTS #### Plotting Changepoint ```{r,echo=FALSE,include=FALSE,warning=FALSE,message=FALSE} dim(sales) library(changepoint) ``` Identifying change points for trend in the series.The single changepoint detected is in January 2015, where we can see that the next trend is bit downward side. ```{r,echo=FALSE} cp = cpt.mean(sales$daily_Sales) sales %>% ggplot()+ geom_line(aes(date,daily_Sales))+ geom_vline(aes(xintercept=sales$date[cpts(cp)]),color='red')+ theme_bw()+ xlab("DATE")+ ylab("DAILY SALES") cpts(cp) ``` Trying to check for multiple changing point with mean values. The multiple change points can be seen in the plot. ```{r,echo=FALSE,warning=FALSE,message=FALSE} mult_cp=cpt.mean(sales$daily_Sales,method='BinSeg',Q=10) plot(mult_cp) ``` ### OUTLIERS #### Box Plot for detecting outliers ```{r,echo=FALSE} box=boxplot(sales$daily_Sales,main="DAILY SALES") ``` The outliers according to the boxplot above are highlighted below. One high value above 300 looks abnormal, but it lies in peak seasonal period, thus cannot be ruled out. Keepinng as it is. #### Plotting all the outliers from the boxplot in the series. ```{r,echo=FALSE,warning=FALSE,message=FALSE} sales %>% mutate( anomaly = if_else(daily_Sales %in% box$out,TRUE,as.logical(NA)) ) %>% ggplot()+ geom_line(aes(date,daily_Sales))+ geom_point(aes(date,daily_Sales,color=anomaly))+ theme_bw()+ ylim(0,400)+ ggtitle("Outliers According to Boxplot") + ylab("DAILY SALES")+ scale_colour_discrete(na.translate = F) ```
```{#r,echo=FALSE,eval=FALSE,include=FALSE} cleaned_data = tsclean(sales$daily_Sales) clean_full_y = sales %>% mutate( row_number = row_number(), anomaly = if_else(row_number %in% outliers$index,TRUE,as.logical(NA)) #cleaned_data = cleaned_data ) %>% ggplot()+ geom_line(aes(date,cleaned_data))+ geom_point(aes(date,cleaned_data,color=anomaly))+ theme_bw()+ ylim(0,400)+ guides(color='none')+ labs(title = 'Outliers Replaced') + scale_colour_discrete(na.translate = F) plot = clean_full_y clean_full_y ``` ```{r} ``` ### FITTING THE PROPHET MODEL #### The model is fitted on train data till date 2015-07-01 and forecast is done fo period of 120 days. ```{r,echo=FALSE,include=FALSE,warning=FALSE} library(prophet) prophet_sales = sales %>% rename(ds = date, # Have to name our date variable "ds" y = daily_Sales) # Have to name our time series "y" train = prophet_sales %>% filter(ds% filter(ds>=ymd("2015-07-01")) model = prophet(train) future = make_future_dataframe(model,periods = 120) forecast = predict(model,future) ``` Model automatically disable daily seasonality as it detects there could be no daily seasonality present. ```{r,echo=FALSE} plot(model,forecast)+ ylab("TOTAL DAILY SALES")+xlab("Date")+theme_bw() ``` ```{r} ``` We try check if there is any difference by turning on the daily seasonality or not. #### Ckeck by Turning on the daily seasonality ```{r,echo=FALSE,include=FALSE,warning=FALSE} model_1 = prophet(train,daily.seasonality = TRUE) future_1 = make_future_dataframe(model,periods = 120) forecast_1 = predict(model,future) ``` .
```{r,echo=FALSE,warning=FALSE,message=FALSE} plot(model_1,forecast)+ ylab("TOTAL DAILY SALES")+xlab("Date")+theme_bw() ``` Not much of a difference here. Thus model correctly disabled the daily seasonality #### Lets see our original series for comparison ```{r,echo=FALSE,warning=FALSE,message=FALSE} ggplot()+ geom_point(aes(date,daily_Sales))+ theme_bw() ```
```{r} ``` #### Interactive chart to check for perticular period. (Use scroller at bottom) ```{r,echo=FALSE,warning=FALSE,message=FALSE} dyplot.prophet(model,forecast) ``` ### CHECKING FOR COMPONENTS OF TIME SERIES #### Plot shows the different components which includes, trend, weekly seasonality and yearly seasonality. ```{r,echo=FALSE} prophet_plot_components(model,forecast) ``` #### Checking changepoint with phophet model #### Plot shows that two three equidistant change points. The model includes one change point found using boxplot. ```{r,echo=FALSE,include=FALSE} plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Total Sales") ``` ```{r,echo=FALSE,warning=FALSE,message=FALSE} model = prophet(train,n.changepoints=5) forecast_2 = predict(model,future) plot(model,forecast_2)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Sales") ``` #### COMPARING FORECAST #### Comparing forecast with actual observed data. The model fits well but has greater variance at start. ```{r,echo=FALSE,warning=FALSE} forecast_plot_data = forecast %>% as_tibble() %>% mutate(ds = as.Date(ds)) %>% filter(ds>=ymd("2015-07-01")) forecast_not_scaled = ggplot()+ geom_line(aes(test$ds,test$y))+ geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red')+ xlab("date")+ ylab("Total sales") forecast_scaled = forecast_not_scaled forecast_scaled ``` #### Trying to forecast it for 250 days ```{r,echo=FALSE,warning=FALSE} two_yr_future = make_future_dataframe(model,periods = 250) two_yr_forecast = predict(model,two_yr_future) plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Total Sales") ``` ```{r,eval=FALSE,include=FALSE,echo=FALSE} train$floor = 0 train$cap = 25 future$floor = 0 future$cap = 25 # Set floor in forecsat data two_yr_future$floor = 0 two_yr_future$cap = 25 #sat_model = prophet(train,growth='logistic',daily.seasonality = T) sat_two_yr_forecast = predict(sat_model,two_yr_future) ``` . ```{r} ``` ```{r,eval=FALSE,include=FALSE,echo=FALSE} plot(sat_model,sat_two_yr_forecast)+ylim(0,25)+ theme_bw()+xlab("Date")+ylab("Daily Sales") ``` ```{r,echo=FALSE,warning=FALSE,eval=FALSE} prophet_plot_components(model,forecast) ``` The prophet is considering seasonality correctly, We dont need to take account for daily seasonality here. We can check for additive and multiplicative seasonality for considration. ### SEASONALITY #### Additive Seasonality ```{r,echo=FALSE,warning=FALSE} additive = prophet(train) add_fcst = predict(additive,future) plot(additive,add_fcst) #summary(add_fcst) #summary(train) ``` #### Plotting additive components. ```{r,echo=FALSE,warning=FALSE} prophet_plot_components(additive,add_fcst) ```

#### Multiplicative seasonality ```{r,echo=FALSE} multi = prophet(train,seasonality.mode = 'multiplicative') multi_fcst = predict(multi,future) plot(multi,multi_fcst) ``` #### Multiplicative components ```{r,echo=FALSE} prophet_plot_components(multi, multi_fcst) ``` #### There is not much of difference between both. #### Checking Holiday Impact #### Leveraging the holiday list available in the prophet model,We could see below which holidays has greater Impact. ```{r,echo=FALSE,warning=FALSE,message=FALSE} #?prophet() model = prophet(train,fit=FALSE) model = add_country_holidays(model,country_name = 'US') model = fit.prophet(model,train) forecast = predict(model,future) prophet_plot_components(model,forecast) ``` #### Checking the amount of impact of two main holidays. New year and Christmas ```{r,echo=FALSE,warning=FALSE} forecast %>% filter( holidays!=0, ds == ymd("2014-12-25") | ds == ymd("2015-01-01")) %>% select(ds,`Christmas Day`,`New Year's Day`) ``` #### Holiday Impact by each holiday in the US calender. ```{r,echo=FALSE,warning=FALSE} forecast %>% filter(holidays != 0) %>% dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>% mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>% summarize_if(is.numeric, ~ max(., na.rm = T)) %>% pivot_longer( cols = `Christmas Day`:`Washington's Birthday`, names_to = 'holiday', values_to = 'effect' ) %>% ggplot() + geom_col(aes(effect,holiday))+ theme_bw() ``` ### MODEL 'FIT' & CROSS VALIDATION #### Checking In sample performance by Calculating the MSE,MAE and MAPE of the model. #### The error values are not large, the fit is good. Generally, Prophet works good on daily data and with seasonality. ```{r,echo=FALSE,warning=FALSE} forecast_metric_data = forecast %>% as_tibble() %>% mutate(ds = as.Date(ds)) %>% filter(ds>=ymd("2015-07-01")) RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2)) MAE = mean(abs(test$y - forecast_metric_data$yhat)) MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y)) print(paste("RMSE:",round(RMSE,2))) print(paste("MAE:",round(MAE,2))) print(paste("MAPE:",round(MAPE,2))) ``` #### Cross Validation Cross- Validation for the window of 30 days. ```{r,echo=FALSE,warning=FALSE,message=FALSE,include=FALSE} df.cv <- cross_validation(model, initial = 700, period = 30, horizon = 30, units = 'days') head(df.cv) ``` #### Cut-off points for the cross validation ```{r,echo=FALSE} unique(df.cv$cutoff) ``` #### Plotting the comparison of cross validation and actual for different cut-off points. ```{r,echo=FALSE} df.cv %>% ggplot()+ geom_point(aes(ds,y)) + geom_point(aes(ds,yhat,color=factor(cutoff)))+ theme_bw()+ xlab("Date")+ ylab("Total sales")+ scale_color_discrete(name = 'Cutoff') ``` ### Conclusion #### This is our Final part. Lets conclude our best model after comparing in sample and out of sample performance. Here we have our best model as ARIMA(7,0,0) which we have chosen by trial and error. The auto model is model suggested by auto.arima and last is model by using prophet library. #### Cross validation plot for PROPHET ```{r,echo=FALSE} plot_cross_validation_metric(df.cv, metric = 'rmse') ``` ```{r,echo=FALSE} #dim(sales) out_list = lapply(828:1004,function(x){ train = sales$daily_Sales[1:x] test = sales$daily_Sales[(x+1):(x+29)] train_mod1 = arima(train,order=c(7,0,0)) train_mod2 = arima(train,order=c(1,0,2)) out_data = data.frame( train_thru = x, time_ahead = 1:29, time_point = x + 1:29, actual = test, naive_fcst = train[x], test_pred1 = predict(train_mod1,29)$pred, test_pred2 = predict(train_mod2,29)$pred, error1 = test - predict(train_mod1,29)$pred, error2 = test - predict(train_mod2,29)$pred ) }) out_data = rbindlist(out_list) ``` ```{r,echo=FALSE,include=FALSE} out_data %>% as_tibble() ``` ```{r,echo=FALSE,message=FALSE,warning=FALSE} error_metrics = out_data %>% mutate(naive_error = actual - naive_fcst) %>% pivot_longer( cols = error1:naive_error, values_to = 'error' ) %>% group_by(time_ahead,name) %>% summarize( RMSE = sqrt(mean(error^2)), MAE = mean(abs(error)), MAPE = mean(abs(error / actual)) ) %>% ungroup() ``` #### CROSS VALIDATION GRAPH FOR ARIMA MODELS ```{r,echo=FALSE} error_metrics %>% ggplot()+ geom_line(aes(time_ahead,RMSE,color=name))+ xlab("Horizon (days)") ``` #### COMPARING IN SAMPLE RMSE AND MAE ```{r,echo=FALSE} xyz = data.frame( Model = c("ARIMA","AUTO ARIMA","PROPHET"), `In Sample RMSE` = round(c(RMSE_best,RMSE_auto,RMSE),3), `In Sample MAE` = round(c(MAE_best,MAE_auto,MAE),3) #`Out of Sample RMSE` = round(c(out_rmse1,out_rmse2,out_rmse3,out_rmse9),3) ) #%>% #datatable() xyz ``` #### LOOKING ALL THE PLOTS AND VALUES ABOVE, WE CONCLUDE THAT PROPHET HAS OUTPERFORMED ALL THE MODELS. THE PROPHET HAS PROVED ITS PREDICTIVE POWER FOR SEASONAL DATA SETS. #### THUS OUR WINNER IS 'PROPHET' MODEL.