We are trying to forecast a time series data of a retail company. The data is about sales of the thousands of products of different families sold at stores.The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers. The data is publically available on Kaggle website. The dataset have sale details for various products on day to day basis. The data has various aspects which can be explored going ahead. For now, we will be using single time series of total daily sales across all the products and try to forecast the sales with different approaches. I find this tak challenging as it involves various factors such as seasonality,holiday events along with forecasting sales.
sales_train.csv The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales. store_nbr identifies the store at which the products are sold. family identifies the type of product sold. on promotion gives the total number of items in a product family that were being promoted at a store at a given date.
spec_tbl_df [2,935,849 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:2935849], format: "2013-01-02" "2013-01-03" ...
$ date_block_num: num [1:2935849] 0 0 0 0 0 0 0 0 0 0 ...
$ shop_id : num [1:2935849] 59 25 25 25 25 25 25 25 25 25 ...
$ item_id : num [1:2935849] 22154 2552 2552 2554 2555 ...
$ item_price : num [1:2935849] 999 899 899 1709 1099 ...
$ item_cnt_day : num [1:2935849] 1 1 -1 1 1 1 1 1 1 3 ...
- attr(*, "spec")=
.. cols(
.. date = col_character(),
.. date_block_num = col_double(),
.. shop_id = col_double(),
.. item_id = col_double(),
.. item_price = col_double(),
.. item_cnt_day = col_double()
.. )
- attr(*, "problems")=<externalptr>
date date_block_num shop_id item_id
Min. :2013-01-01 Min. : 0.00 Min. : 0 Min. : 0
1st Qu.:2013-08-01 1st Qu.: 7.00 1st Qu.:22 1st Qu.: 4476
Median :2014-03-04 Median :14.00 Median :31 Median : 9343
Mean :2014-04-03 Mean :14.57 Mean :33 Mean :10197
3rd Qu.:2014-12-05 3rd Qu.:23.00 3rd Qu.:47 3rd Qu.:15684
Max. :2015-10-31 Max. :33.00 Max. :59 Max. :22169
item_price item_cnt_day
Min. : -1.0 Min. : -22.000
1st Qu.: 249.0 1st Qu.: 1.000
Median : 399.0 Median : 1.000
Mean : 890.9 Mean : 1.243
3rd Qu.: 999.0 3rd Qu.: 1.000
Max. :307980.0 Max. :2169.000
[1] "Unique block numbers are the months in the data"
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[26] 25 26 27 28 29 30 31 32 33
[1] "Number of NA values in columns are"
$date
[1] 0
$date_block_num
[1] 0
$shop_id
[1] 0
$item_id
[1] 0
$item_price
[1] 0
$item_cnt_day
[1] 0
date date_block_num shop_id item_id
Min. :2013-01-01 Min. : 0.00 Min. : 0 Min. : 0
1st Qu.:2013-08-01 1st Qu.: 7.00 1st Qu.:22 1st Qu.: 4476
Median :2014-03-04 Median :14.00 Median :31 Median : 9343
Mean :2014-04-03 Mean :14.57 Mean :33 Mean :10197
3rd Qu.:2014-12-05 3rd Qu.:23.00 3rd Qu.:47 3rd Qu.:15684
Max. :2015-10-31 Max. :33.00 Max. :59 Max. :22169
item_price item_cnt_day
Min. : -1.0 Min. : -22.000
1st Qu.: 249.0 1st Qu.: 1.000
Median : 399.0 Median : 1.000
Mean : 890.9 Mean : 1.243
3rd Qu.: 999.0 3rd Qu.: 1.000
Max. :307980.0 Max. :2169.000
We could see in summry that there is some inconsistancy in price and item count per day column. We have some negative values which does not make sense. Lets see in detail.
[1] "Number of negative values in item_cnt_day"
[1] 7356
[1] "Number of negative values in item_price"
[1] 1
There is value in item price which is too far from other values, as it is only one value there is high chance of error, so we replace that by median and plot new boxplot.
| Name | Class | Values | Missing | Summary |
|---|---|---|---|---|
| date | Date | Time: 2013-01-01 to 2015-10-31 | 0 | mean: 16163.2392936421, sd: 286.899, nuniq: 1034 |
| date_block_num | numeric | Num: 0 to 33 | 0 | mean: 14.57, sd: 9.423, nuniq: 34 |
| shop_id | numeric | Num: 0 to 59 | 0 | mean: 33.002, sd: 16.227, nuniq: 60 |
| item_id | numeric | Num: 0 to 22169 | 0 | mean: 10197.227, sd: 6324.297, nuniq: 21807 |
| item_price | numeric | Num: -1 to 307980 | 0 | mean: 890.853, sd: 1729.8, nuniq: 19993 |
| item_cnt_day | numeric | Num: -22 to 2169 | 0 | mean: 1.243, sd: 2.619, nuniq: 198 |
date daily_Sales
Min. :2013-01-01 Min. : 11.94
1st Qu.:2013-09-16 1st Qu.: 20.96
Median :2014-06-01 Median : 26.64
Mean :2014-06-01 Mean : 32.87
3rd Qu.:2015-02-14 3rd Qu.: 36.94
Max. :2015-10-31 Max. :365.45
Looking at plot, the series looks quite stationary, but have seasonal factors involved. The trend and variance seems to be same overall series.
The series appeared to be bit stationary before in terms of both variance and mean. Plotting again with trend line.The trend is bit constant here.
From above plots we could see that there is strong autocorrelation in the data
Augmented Dickey-Fuller Test
data: sales$daily_Sales
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
KPSS Test for Level Stationarity
data: sales$daily_Sales
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208
The ADF test has Null hypothesis that given series is not stationary and KPSS test has Null Hypothesis as the given series is stationary. Thus both test suggest that given series is already stationary.
We will try to check how much difference it makes if we transformed the given series.
The distribution looks right skewed.
Now the distribution of the sales is close to normal. But Remember, we dont look for the ditribution of Y variable to be normal, that is not our concern. So we just leave it here.
There seem to be no difference in original and transformed series here, we have already checked that our series is stationary. Box-Cox performed well here by not doing anything.
Augmented Dickey-Fuller Test
data: train_log$sale_log
Dickey-Fuller = -5.3587, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
KPSS Test for Level Stationarity
data: train_log$sale_log
KPSS Level = 0.6095, Truncation lag parameter = 7, p-value = 0.02177
[1] "'\n "
Augmented Dickey-Fuller Test
data: sales$daily_Sales
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
KPSS Test for Level Stationarity
data: sales$daily_Sales
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208
Augmented Dickey-Fuller Test
data: train_box_cox$sale_box_cox
Dickey-Fuller = -5.8801, Lag order = 10, p-value = 0.01
alternative hypothesis: stationary
KPSS Test for Level Stationarity
data: train_box_cox$sale_box_cox
KPSS Level = 0.41177, Truncation lag parameter = 7, p-value = 0.07208
Both test indicates that the series is stationary As our series is already stationary, we will move ahead with orginal series.
The above ACF and PACF plot shows us that there is high auto correlation. The ACF has dampning effect and PACF shows no significant spike after 7. This confirms that this is AR process. The spikes in PACF are significant at level 3 and 7. so we will check for both levels.
Call:
arima(x = sales$daily_Sales, order = c(7, 0, 0))
Coefficients:
ar1 ar2 ar3 ar4 ar5 ar6 ar7 intercept
0.4185 0.0377 0.0834 0.0072 -0.0030 0.0735 0.1731 32.8466
s.e. 0.0306 0.0333 0.0333 0.0334 0.0332 0.0332 0.0306 2.6326
sigma^2 estimated as 321.5: log likelihood = -4452.28, aic = 8922.55
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.008477461 17.93103 8.288944 -11.71179 24.19535 0.886353
ACF1
Training set 0.001486589
Ljung-Box test
data: Residuals from ARIMA(7,0,0) with non-zero mean
Q* = 3.0945, df = 3, p-value = 0.3773
Model df: 8. Total lags used: 11
The residual are looking Good. But there seems to be litle seaonal factor left according to residual graphs. But we seems to be close to the best model. Here we will try different models near to (3,0,0) and comapre there AIC’s.
Call:
arima(x = sales$daily_Sales, order = c(3, 0, 0))
Coefficients:
ar1 ar2 ar3 intercept
0.4687 0.0540 0.1358 32.8368
s.e. 0.0308 0.0341 0.0308 1.6808
sigma^2 estimated as 342.4: log likelihood = -4484.62, aic = 8979.24
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.01313103 18.50431 9.040962 -13.37059 27.32408 0.9667678
ACF1
Training set -0.008090044
Ljung-Box test
data: Residuals from ARIMA(3,0,0) with non-zero mean
Q* = 65.071, df = 6, p-value = 4.171e-12
Model df: 4. Total lags used: 10
If we compare the two, we see that there are still some significant spike left at lag 7 in ARIMA(3,0,0). There are no significant spikes for model ARIMA(7,0,0). The error distribution looks normal expect some large error at last. That might be the error becuase of some high seasonal value. Overall the ARIMA(7,0,0) looks good.
df AIC
arima(sales$daily_Sales, order = c(7, 0, 0)) 9 8922.551
arima(sales$daily_Sales, order = c(8, 0, 0)) 10 8924.506
arima(sales$daily_Sales, order = c(4, 0, 0)) 6 8977.230
arima(sales$daily_Sales, order = c(5, 0, 0)) 7 8973.788
arima(sales$daily_Sales, order = c(6, 0, 0)) 8 8952.044
arima(sales$daily_Sales, order = c(7, 0, 1)) 10 8924.527
arima(sales$daily_Sales, order = c(6, 0, 1)) 9 8937.107
arima(sales$daily_Sales, order = c(5, 0, 1)) 8 8949.164
arima(sales$daily_Sales, order = c(9, 0, 0)) 11 8921.953
The model looks better comapared to others. But we got some low values when we added MA component in the model. Also when we look at the residual plot, we could see there some spikes at regular intervals. This might be taken care if we add MA component. Lets try different values and compare the AIC’s again.
df AIC
arima(sales$daily_Sales, order = c(7, 0, 2)) 11 8922.915
arima(sales$daily_Sales, order = c(7, 0, 3)) 12 8916.015
arima(sales$daily_Sales, order = c(7, 0, 4)) 13 8916.173
arima(sales$daily_Sales, order = c(6, 0, 2)) 10 8912.419
arima(sales$daily_Sales, order = c(6, 0, 3)) 11 8904.982
arima(sales$daily_Sales, order = c(6, 0, 4)) 12 8908.598
arima(sales$daily_Sales, order = c(5, 0, 2)) 9 8919.983
arima(sales$daily_Sales, order = c(5, 0, 3)) 10 8916.787
arima(sales$daily_Sales, order = c(8, 0, 4)) 14 8918.459
arima(sales$daily_Sales, order = c(8, 0, 2)) 12 8924.735
arima(sales$daily_Sales, order = c(8, 0, 3)) 13 8926.730
df AIC
arima(sales$daily_Sales, order = c(3, 0, 0)) 5 8979.244
arima(sales$daily_Sales, order = c(3, 0, 1)) 6 8950.477
arima(sales$daily_Sales, order = c(3, 0, 2)) 7 8949.107
arima(sales$daily_Sales, order = c(3, 0, 3)) 8 8954.167
arima(sales$daily_Sales, order = c(4, 0, 1)) 7 8952.477
arima(sales$daily_Sales, order = c(4, 0, 2)) 8 8954.092
arima(sales$daily_Sales, order = c(4, 0, 3)) 9 8914.693
arima(sales$daily_Sales, order = c(5, 0, 1)) 8 8949.164
arima(sales$daily_Sales, order = c(5, 0, 2)) 9 8919.983
arima(sales$daily_Sales, order = c(5, 0, 3)) 10 8916.787
arima(sales$daily_Sales, order = c(2, 0, 1)) 5 8949.963
We could see that we got the lowest AIC for ARIMA(6,0,4). Lets try to check its residual graphs.
Ljung-Box test
data: Residuals from ARIMA(6,0,3) with non-zero mean
Q* = 15.419, df = 3, p-value = 0.001492
Model df: 10. Total lags used: 13
The AIC is reduced compare to ARIMA(7,0,0), but residual has still some spikes left. Keeping the lag at 7 in mind, we will try MA component for lag 7.
Ljung-Box test
data: Residuals from ARIMA(7,0,3) with non-zero mean
Q* = 19.952, df = 3, p-value = 0.0001737
Model df: 11. Total lags used: 14
We are still left with the spike. We will try the next best AIC that is ARIMA (7,0,2).
Ljung-Box test
data: Residuals from ARIMA(7,0,2) with non-zero mean
Q* = 1.9906, df = 3, p-value = 0.5744
Model df: 10. Total lags used: 13
The model chosen at start ARIMA(7,0,0) is gave closely similar results to ARIMA(7,0,2). Lets try to compare them.
Call:
arima(x = sales$daily_Sales, order = c(7, 0, 0))
Coefficients:
ar1 ar2 ar3 ar4 ar5 ar6 ar7 intercept
0.4185 0.0377 0.0834 0.0072 -0.0030 0.0735 0.1731 32.8466
s.e. 0.0306 0.0333 0.0333 0.0334 0.0332 0.0332 0.0306 2.6326
sigma^2 estimated as 321.5: log likelihood = -4452.28, aic = 8922.55
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.008477461 17.93103 8.288944 -11.71179 24.19535 0.886353
ACF1
Training set 0.001486589
Call:
arima(x = sales$daily_Sales, order = c(7, 0, 2))
Coefficients:
ar1 ar2 ar3 ar4 ar5 ar6 ar7 ma1 ma2
0.6117 -0.2929 0.1873 0.0006 0.0174 0.0797 0.1640 -0.1936 0.2589
s.e. 0.2323 0.2007 0.0606 0.0412 0.0388 0.0383 0.0502 0.2352 0.1229
intercept
32.8431
s.e. 2.5282
sigma^2 estimated as 320.4: log likelihood = -4450.46, aic = 8922.91
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.00752713 17.89927 8.219159 -11.61345 23.95606 0.8788907
ACF1
Training set 0.0009992537
There is not much of difference in the models, thus we will go ahead with a simpler model .i.e ARIMA(7,0,0)
We havent checked the ARIMA(3,0,0) counter parts,as we have selected ARIMA(7,0,0) now, we have options of ARIMA(3,0,0) counterparts with lower AIC’s. Lets check there residuals
Ljung-Box test
data: Residuals from ARIMA(4,0,3) with non-zero mean
Q* = 34.42, df = 3, p-value = 1.616e-07
Model df: 8. Total lags used: 11
Ljung-Box test
data: Residuals from ARIMA(5,0,3) with non-zero mean
Q* = 34.14, df = 3, p-value = 1.851e-07
Model df: 9. Total lags used: 12
The spike at 7 is significant in all the models except ARIMA(7,0,0). Thus we will select ARIMA(7,0,0) as our best model. This could be the result of weekly seasonality present in the data set.
Ljung-Box test
data: Residuals from ARIMA(7,0,0) with non-zero mean
Q* = 3.0945, df = 3, p-value = 0.3773
Model df: 8. Total lags used: 11
Series: sales$daily_Sales
ARIMA(1,0,2) with non-zero mean
Coefficients:
ar1 ma1 ma2 mean
0.9424 -0.515 -0.2076 32.8992
s.e. 0.0173 0.036 0.0333 2.6977
sigma^2 = 333.7: log likelihood = -4469.45
AIC=8948.91 AICc=8948.96 BIC=8973.61
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.01805955 18.23338 8.830803 -12.60878 26.23437 0.9442951
ACF1
Training set 0.004409511
We are getting different model ARIMA(1,0,2) than our best estimation from auto.arima
Performing Ljung Box test for the both best model and auto model at different lags, the Ljung-Box test has hypothesis that the residuals are independent. Thus the P value less than 0.05 indicates that there exist a correlation in the residuals.
Box-Ljung test
data: best_fit$residuals
X-squared = 0.0022917, df = 1, p-value = 0.9618
Box-Ljung test
data: auto_fit$residuals
X-squared = 0.020163, df = 1, p-value = 0.8871
Box-Ljung test
data: best_fit$residuals
X-squared = 0.35142, df = 3, p-value = 0.9501
Box-Ljung test
data: auto_fit$residuals
X-squared = 0.36797, df = 3, p-value = 0.9468
Box-Ljung test
data: best_fit$residuals
X-squared = 0.48931, df = 7, p-value = 0.9995
Box-Ljung test
data: auto_fit$residuals
X-squared = 29.999, df = 7, p-value = 9.501e-05
Box-Ljung test
data: best_fit$residuals
X-squared = 2.8374, df = 10, p-value = 0.985
Box-Ljung test
data: auto_fit$residuals
X-squared = 37.105, df = 10, p-value = 5.427e-05
[1] 17.93103
[1] 18.23338
[1] 8.288944
[1] 8.830803
pred.pred pred.se date
Min. :26.55 Min. :17.93 Min. :2015-10-31 12:00:00
1st Qu.:29.83 1st Qu.:21.07 1st Qu.:2015-11-07 18:00:00
Median :30.90 Median :22.03 Median :2015-11-15 00:00:00
Mean :30.39 Mean :21.53 Mean :2015-11-15 00:00:00
3rd Qu.:31.64 3rd Qu.:22.35 3rd Qu.:2015-11-22 06:00:00
Max. :32.11 Max. :22.46 Max. :2015-11-29 12:00:00
Identifying change points for trend in the series.The single changepoint detected is in January 2015, where we can see that the next trend is bit downward side.
[1] 748
Trying to check for multiple changing point with mean values.
The multiple change points can be seen in the plot.
The outliers according to the boxplot above are highlighted below. One high value above 300 looks abnormal, but it lies in peak seasonal period, thus cannot be ruled out. Keepinng as it is.
cleaned_data = tsclean(sales$daily_Sales)
clean_full_y = sales %>%
mutate(
row_number = row_number(),
anomaly = if_else(row_number %in% outliers$index,TRUE,as.logical(NA))
#cleaned_data = cleaned_data
) %>%
ggplot()+
geom_line(aes(date,cleaned_data))+
geom_point(aes(date,cleaned_data,color=anomaly))+
theme_bw()+
ylim(0,400)+
guides(color='none')+
labs(title = 'Outliers Replaced') +
scale_colour_discrete(na.translate = F)
plot = clean_full_y
clean_full_y
Model automatically disable daily seasonality as it detects there could be no daily seasonality present.
We try check if there is any difference by turning on the daily seasonality or not. #### Ckeck by Turning on the daily seasonality
.Not much of a difference here. Thus model correctly disabled the daily seasonality
.
The prophet is considering seasonality correctly, We dont need to take account for daily seasonality here. We can check for additive and multiplicative seasonality for considration.
ds Christmas Day New Year's Day
1 2014-12-25 -18.6936 0.00000
2 2015-01-01 0.0000 -57.80842
[1] "RMSE: 10.91"
[1] "MAE: 8.67"
[1] "MAPE: 0.38"
Cross- Validation for the window of 30 days.
[1] "2014-12-02 GMT" "2015-01-01 GMT" "2015-01-31 GMT" "2015-03-02 GMT"
[5] "2015-04-01 GMT" "2015-05-01 GMT" "2015-05-31 GMT"
Model In.Sample.RMSE In.Sample.MAE
1 ARIMA 17.931 8.289
2 AUTO ARIMA 18.233 8.831
3 PROPHET 10.911 8.666
---
title: "Forecasating the sales data"
output:
flexdashboard::flex_dashboard:
social: menu
source: embed
vertical_layout: fill
orientation: columns
theme: cerulean
---
```{r,echo=FALSE,include=FALSE}
rm(list = ls())
#install.packages("tidyverse")
#install.packages("datatable")
library(dplyr)
library(tidyverse)
library(lubridate)
library(forecast) #boxcox
library(tseries)
library(vtable)
library(flexdashboard) #Presentation quality
library(data.table)
```
INTRODUCTION
=======================================================================
## ABOUT DATA {.tabset}
We are trying to forecast a time series data of a retail company.
The data is about sales of the thousands of products of different families sold at stores.The training data includes dates, store and product information, whether that item was being promoted, as well as the sales numbers.
The data is publically available on Kaggle website.
The dataset have sale details for various products on day to day basis. The data has various aspects which can be explored going ahead. For now, we will be using single time series of total daily sales across all the products and try to forecast the sales with different approaches.
I find this tak challenging as it involves various factors such as seasonality,holiday events along with forecasting sales.
### Data summary
#### File Descriptions and Data Field Information
sales_train.csv
The training data, comprising time series of features store_nbr, family, and on promotion as well as the target sales.
store_nbr identifies the store at which the products are sold.
family identifies the type of product sold.
on promotion gives the total number of items in a product family that were being promoted at a store at a given date.
```{r,echo=FALSE,include=FALSE}
sales <- readr::read_csv('D:\\VAIBHAV\\HOMEWORK\\Time Series\\ASSIGNMENT\\sales_train.csv')
```
#### Structure of Data
```{r,echo=FALSE}
#str(sales)
sales$date<- dmy(sales$date)
str(sales)
```
```{r,echo=FALSE}
summary(sales)
```
### BASIC DATA QUALITY CHECKS
#### Null value check
```{r,echo=FALSE}
attach(sales)
print("Unique block numbers are the months in the data")
unique(date_block_num)
print("Number of NA values in columns are")
lapply(sales,function(x) { length(which(is.na(sales)))})
```
#### Summary
```{r,echo=FALSE}
summary(sales)
```
We could see in summry that there is some inconsistancy in price and item count per day column. We have some negative values which does not make sense. Lets see in detail.
```{r,echo=FALSE}
print('Number of negative values in item_cnt_day')
sum(item_cnt_day<0)
print("Number of negative values in item_price")
sum(item_price<0)
```
#### Boxplot for both columns
```{r,echo=FALSE}
boxplot(item_price,main = " Item Price")
boxplot(item_cnt_day, main = "Item count per day")
```
There is value in item price which is too far from other values, as it is only one value there is high chance of error, so we replace that by median and plot new boxplot.
```{r,echo=FALSE}
item_price[item_price < 0] <- 0
item_cnt_day[item_cnt_day < 0] <- 0
item_price[item_price > 307970] <- median(item_price)
```
#### Updated box Plot
```{r,echo=FALSE}
boxplot(item_price,main="Item Price")
boxplot(item_cnt_day,main="Item count per day")
```
#### Summary statistics
```{r,echo=FALSE}
vtable(sales,lush = TRUE)
```
Creating our time series to track total daily sales. Keeping only two columns hereafter.
```{r,echo=FALSE}
sales <- sales %>% group_by(date) %>%
mutate(daily_Sales= sum(item_price*item_cnt_day)/100000) %>%
arrange(date) %>%
ungroup()
sales <- subset(sales,select = c(date,daily_Sales))
sales <- sales %>% group_by(date) %>% filter(!duplicated(date))
```
```{r,echo=FALSE}
summary(sales)
```
#### Daily total sales plot
```{r,echo=FALSE}
sales %>%
ggplot()+
geom_line(aes(date,daily_Sales))+
theme_bw()+
xlab("Date")+
ylab("TOTAL DAILY SALES")
```
Looking at plot, the series looks quite stationary, but have seasonal factors involved. The trend and variance seems to be same overall series.
#### Cummulative sales per day.
```{r,echo=FALSE}
sales %>% ggplot(aes(colour="Blue"))+
geom_line(aes(sales$date,cumsum(sales$daily_Sales)),size=1.5,show.legend = FALSE)+
theme_bw()+
ggtitle("TOTAL CUMMULATIVE SALES BY DATE")+
ylab("Cummulative Sales")+
xlab("DATE")
#?GeomLine
```
APPROACHING TO MODELLING
=======================================================================
## Understanding the series {.tabset}
### UNDERSTANDING THE SERIES
```{r,echo=FALSE,include=FALSE}
sales %>% ggplot()+
geom_line(aes(sales$date,sales$daily_Sales))+
theme_bw()+
#geom_smooth(aes(sales$date,sales$daily_Sales),method = "lm")+
ggtitle("TOATAL DAILY SALES")+
xlab("Date")+
ylab("Total daily sales")
```
The series appeared to be bit stationary before in terms of both variance and mean. Plotting again with trend line.The trend is bit constant here.
#### Lets see if we have any auto-correlation in the series to move ahead
#### ACF & PACF GRAPHS
```{r,echo=FALSE,message=FALSE,warning=FALSE}
attach(sales)
par(mfrow=c(1,2))
acf(sales$daily_Sales,lag.max = 40)
Pacf(sales$daily_Sales, lag.max = 40)
```
From above plots we could see that there is strong autocorrelation in the data
#### Checking wheather the given series is stationary or not
```{r,echo=FALSE,warning=FALSE}
adf.test(sales$daily_Sales)
kpss.test(sales$daily_Sales)
```
The ADF test has Null hypothesis that given series is not stationary and KPSS test has Null Hypothesis as the given series is stationary. Thus both test suggest that given series is already stationary.
We will try to check how much difference it makes if we transformed the given series.
#### Lets check the histogram of our sales
```{r,echo=FALSE}
hist(sales$daily_Sales,main="Daily sales",xlab = "DAILY SALES", ylab = "Frequency")
```
The distribution looks right skewed.
#### We will log transform the series and plot the histogram.
```{r,echo=FALSE}
sales_1 <- sales
train_log = sales_1 %>%
mutate(sale_log = log1p(daily_Sales))
train_log %>%
ggplot()+
geom_line(aes(date,sale_log))+
theme_bw()+
ggtitle("SALES over Time (Log)")+
ylab("DAILY SALE (Log)")+
xlab("Date")
```
Now the distribution of the sales is close to normal. But Remember, we dont look for the ditribution of Y variable to be normal, that is not our concern. So we just leave it here.
#### Lets try box-cox transformation and check
```{r,echo=FALSE}
train_box_cox <- sales_1 %>%
mutate(sale_box_cox = forecast::BoxCox(daily_Sales,lambda="auto"))
train_box_cox %>% ggplot()+
geom_line(aes(date,sale_box_cox))+
theme_bw()+
ggtitle("SALES PER DAY-BOX-COX TRANSFORMED")+
ylab("TOTAL SALES")
```
There seem to be no difference in original and transformed series here, we have already checked that our series is stationary. Box-Cox performed well here by not doing anything.
#### Comparing histogram of all three
```{r,echo=FALSE}
hist(sales$daily_Sales,main="Daily sales",xlab = "DAILY SALES", ylab = "Frequency")
hist(train_log$sale_log,main="Daily sales Log transformed",xlab = "DAILY SALES", ylab = "Frequency")
hist(train_box_cox$sale_box_cox,main="Daily sales box-cox transformed",xlab = "DAILY SALES", ylab = "Frequency")
```
#### Performing stationarity test on all.
```{r,echo=FALSE}
adf.test(train_log$sale_log)
kpss.test(train_log$sale_log)
print("'
")
adf.test(sales$daily_Sales)
kpss.test(sales$daily_Sales)
adf.test(train_box_cox$sale_box_cox)
kpss.test(train_box_cox$sale_box_cox)
```
Both test indicates that the series is stationary
As our series is already stationary, we will move ahead with orginal series.
### FINALISING THE MODEL
#### As our series is now stationary lets try to figure out what type of series it is.
Checking for auto correlation.
```{r,echo=FALSE}
par(mfrow=c(1,2))
acf(sales$daily_Sales,main="Daily sales ACF")
pacf(sales$daily_Sales,main="Daily sales PACF")
```
The above ACF and PACF plot shows us that there is high auto correlation. The ACF has dampning effect and PACF shows no significant spike after 7. This confirms that this is AR process. The spikes in PACF are significant at level 3 and 7. so we will check for both levels.
```{r,echo=FALSE}
model_1 = arima(sales$daily_Sales,order = c(7,0,0))
summary(model_1)
```
```{r,echo=FALSE}
forecast::checkresiduals(model_1)
```
The residual are looking Good. But there seems to be litle seaonal factor left according to residual graphs. But we seems to be close to the best model.
Here we will try different models near to (3,0,0) and comapre there AIC's.
```{r,echo=FALSE}
model_2 = arima(sales$daily_Sales,order = c(3,0,0))
summary(model_2)
```
```{r,echo=FALSE}
forecast::checkresiduals(model_2)
```
If we compare the two, we see that there are still some significant spike left at lag 7 in ARIMA(3,0,0). There are no significant spikes for model ARIMA(7,0,0). The error distribution looks normal expect some large error at last. That might be the error becuase of some high seasonal value. Overall the ARIMA(7,0,0) looks good.
#### To further check if any other close model is better than ARIMA(7,0,0). We will compare the AIC's of them.
```{r,echo=FALSE}
AIC(
arima(sales$daily_Sales,order=c(7,0,0)),
arima(sales$daily_Sales,order=c(8,0,0)),
arima(sales$daily_Sales,order=c(4,0,0)),
arima(sales$daily_Sales,order=c(5,0,0)),
arima(sales$daily_Sales,order=c(6,0,0)),
arima(sales$daily_Sales,order=c(7,0,1)),
arima(sales$daily_Sales,order=c(6,0,1)),
arima(sales$daily_Sales,order=c(5,0,1)),
arima(sales$daily_Sales,order=c(9,0,0))
)
```
The model looks better comapared to others. But we got some low values when we added MA component in the model. Also when we look at the residual plot, we could see there some spikes at regular intervals. This might be taken care if we add MA component. Lets try different values and compare the AIC's again.
```{r,echo=FALSE}
AIC(
arima(sales$daily_Sales,order=c(7,0,2)),
arima(sales$daily_Sales,order=c(7,0,3)),
arima(sales$daily_Sales,order=c(7,0,4)),
arima(sales$daily_Sales,order=c(6,0,2)),
arima(sales$daily_Sales,order=c(6,0,3)),
arima(sales$daily_Sales,order=c(6,0,4)),
arima(sales$daily_Sales,order=c(5,0,2)),
arima(sales$daily_Sales,order=c(5,0,3)),
arima(sales$daily_Sales,order=c(8,0,4)),
arima(sales$daily_Sales,order=c(8,0,2)),
arima(sales$daily_Sales,order=c(8,0,3))
)
```
```{r,echo=FALSE}
AIC(
arima(sales$daily_Sales,order=c(3,0,0)),
arima(sales$daily_Sales,order=c(3,0,1)),
arima(sales$daily_Sales,order=c(3,0,2)),
arima(sales$daily_Sales,order=c(3,0,3)),
arima(sales$daily_Sales,order=c(4,0,1)),
arima(sales$daily_Sales,order=c(4,0,2)),
arima(sales$daily_Sales,order=c(4,0,3)),
arima(sales$daily_Sales,order=c(5,0,1)),
arima(sales$daily_Sales,order=c(5,0,2)),
arima(sales$daily_Sales,order=c(5,0,3)),
arima(sales$daily_Sales,order=c(2,0,1))
)
```
We could see that we got the lowest AIC for ARIMA(6,0,4). Lets try to check its residual graphs.
```{r,echo=FALSE}
model_7 <- arima(sales$daily_Sales,order = c(6,0,3))
forecast::checkresiduals(model_7)
```
The AIC is reduced compare to ARIMA(7,0,0), but residual has still some spikes left. Keeping the lag at 7 in mind, we will try MA component for lag 7.
```{r,echo=FALSE}
model_3 <- arima(sales$daily_Sales,order = c(7,0,3))
forecast::checkresiduals(model_3)
```
We are still left with the spike. We will try the next best AIC that is ARIMA (7,0,2).
```{r,echo=FALSE}
model_4 <- arima(sales$daily_Sales,order = c(7,0,2))
forecast::checkresiduals(model_4)
```
The model chosen at start ARIMA(7,0,0) is gave closely similar results to ARIMA(7,0,2). Lets try to compare them.
```{r,echo=FALSE}
summary(model_1)
summary(model_4)
```
There is not much of difference in the models, thus we will go ahead with a simpler model .i.e ARIMA(7,0,0)
We havent checked the ARIMA(3,0,0) counter parts,as we have selected ARIMA(7,0,0) now, we have options of ARIMA(3,0,0) counterparts with lower AIC's. Lets check there residuals
```{r,echo=FALSE}
model_5 <- arima(sales$daily_Sales,order = c(4,0,3))
forecast::checkresiduals(model_5)
model_6 <- arima(sales$daily_Sales,order = c(5,0,3))
forecast::checkresiduals(model_6)
```
The spike at 7 is significant in all the models except ARIMA(7,0,0). Thus we will select ARIMA(7,0,0) as our best model. This could be the result of weekly seasonality present in the data set.
#### From both AIC and BIC, the best model is ARIMA(7,0,0)
```{r,echo=FALSE}
best_fit <- arima(sales$daily_Sales,order = c(7,0,0))
forecast::checkresiduals(best_fit)
```
### AUTO ARIMA SUGGESTIONS.
#### Fitting the model
```{r,echo=FALSE}
auto_fit <- auto.arima(sales$daily_Sales,seasonal= TRUE,stepwise=FALSE,approximation=FALSE,max.p = 15,max.q = 15)
summary(auto_fit)
```
We are getting different model ARIMA(1,0,2) than our best estimation from auto.arima
#### Checking the residuals.
Performing Ljung Box test for the both best model and auto model at different lags, the Ljung-Box test has hypothesis that the residuals are independent. Thus the P value less than 0.05 indicates that there exist a correlation in the residuals.
```{r,echo=FALSE}
Box.test(best_fit$residuals,type = "Ljung-Box",lag=1)
Box.test(auto_fit$residuals,type = "Ljung-Box",lag=1)
```
```{r,echo=FALSE}
Box.test(best_fit$residuals,type = "Ljung-Box",lag=3)
Box.test(auto_fit$residuals,type = "Ljung-Box",lag=3)
```
```{r,echo=FALSE}
Box.test(best_fit$residuals,type = "Ljung-Box",lag=7)
Box.test(auto_fit$residuals,type = "Ljung-Box",lag=7)
```
```{r,echo=FALSE}
Box.test(best_fit$residuals,type = "Ljung-Box",lag=10)
Box.test(auto_fit$residuals,type = "Ljung-Box",lag=10)
```
#### The auto arima model fails the Ljung-Box test at lag seven,thus our best model will be ARIMA(7,0,0)
### IN SAMPLE RMSE
```{r,echo=FALSE}
resid_1 = best_fit$residuals
resid_2 = auto_fit$residuals
pred_1 = sales$daily_Sales-resid_1
pred_2 = sales$daily_Sales-resid_2
ggplot()+
geom_line(aes(sales$daily_Sales,pred_1),color='red',alpha=0.7)+
geom_line(aes(sales$daily_Sales,pred_2),color='blue',alpha=0.7)+
ggtitle("BEST MODEL VS AUTO ARIMA")+
xlab("Daily sales")+
ylab("Prediction")
# geom_line(aes(train_main_1$Week,train_main_1$week_bx))
```
```{r,echo=FALSE}
RMSE_best = sqrt(mean((pred_1 - sales$daily_Sales)^2))
RMSE_best
RMSE_auto = sqrt(mean((pred_2 - sales$daily_Sales)^2))
RMSE_auto
```
#### We now will calculate MAE for fitted values
```{r,echo=FALSE}
MAE_best = mean(abs(pred_1 - sales$daily_Sales))
MAE_best
MAE_auto = mean(abs(pred_2 - sales$daily_Sales))
MAE_auto
```
#### The MAE is less for the model we choose
```{r,echo=FALSE}
best_fit %>%
forecast(h=30) %>%
autoplot()
```
```{r,echo=FALSE}
auto_fit %>%
forecast(h=30) %>%
autoplot()
```
```{r, echo=FALSE,warning=FALSE}
#length(sales$date)
prediction = predict(best_fit,n.ahead=30)
pred_data = data.frame(
pred = prediction,
date = seq(c(ISOdate(2015,10,31)), by = "day", length.out = 30) #as.Date(ymd(max(sales$date)):ymd(max(sales$date)+days(29)),origin="2013-01-01")
)
summary(pred_data)
pred_merge = sales %>%
full_join(pred_data) %>%
filter(date>ymd("2015-09-01")) %>% #& date<=ymd("2015-03-01")) %>%
mutate(
pred_high = pred.pred+2*pred.se,
pred_low = pred.pred - 2*pred.se,
pred.pred = pred.pred,
error = pred.pred - daily_Sales
)
pred_merge %>%
ggplot()+
geom_line(aes(date,daily_Sales))+
geom_line(aes(date,pred.pred),color='blue')+
xlab("DATE - YEAR 2015")+
ylab("SALES")+
geom_ribbon(aes(x=date,ymin=pred_low,ymax = pred_high),fill='blue',alpha=0.2)
```
```{r,echo=FALSE}
```
USING FB PROPHET
=======================================================================
## Starting with PROPHET{.tabset}
### CHANGEPOINTS
#### Plotting Changepoint
```{r,echo=FALSE,include=FALSE,warning=FALSE,message=FALSE}
dim(sales)
library(changepoint)
```
Identifying change points for trend in the series.The single changepoint detected is in January 2015, where we can see that the next trend is bit downward side.
```{r,echo=FALSE}
cp = cpt.mean(sales$daily_Sales)
sales %>%
ggplot()+
geom_line(aes(date,daily_Sales))+
geom_vline(aes(xintercept=sales$date[cpts(cp)]),color='red')+
theme_bw()+
xlab("DATE")+
ylab("DAILY SALES")
cpts(cp)
```
Trying to check for multiple changing point with mean values.
The multiple change points can be seen in the plot.
```{r,echo=FALSE,warning=FALSE,message=FALSE}
mult_cp=cpt.mean(sales$daily_Sales,method='BinSeg',Q=10)
plot(mult_cp)
```
### OUTLIERS
#### Box Plot for detecting outliers
```{r,echo=FALSE}
box=boxplot(sales$daily_Sales,main="DAILY SALES")
```
The outliers according to the boxplot above are highlighted below.
One high value above 300 looks abnormal, but it lies in peak seasonal period, thus cannot be ruled out. Keepinng as it is.
#### Plotting all the outliers from the boxplot in the series.
```{r,echo=FALSE,warning=FALSE,message=FALSE}
sales %>%
mutate(
anomaly = if_else(daily_Sales %in% box$out,TRUE,as.logical(NA))
) %>%
ggplot()+
geom_line(aes(date,daily_Sales))+
geom_point(aes(date,daily_Sales,color=anomaly))+
theme_bw()+
ylim(0,400)+
ggtitle("Outliers According to Boxplot") +
ylab("DAILY SALES")+
scale_colour_discrete(na.translate = F)
```
```{#r,echo=FALSE,eval=FALSE,include=FALSE}
cleaned_data = tsclean(sales$daily_Sales)
clean_full_y = sales %>%
mutate(
row_number = row_number(),
anomaly = if_else(row_number %in% outliers$index,TRUE,as.logical(NA))
#cleaned_data = cleaned_data
) %>%
ggplot()+
geom_line(aes(date,cleaned_data))+
geom_point(aes(date,cleaned_data,color=anomaly))+
theme_bw()+
ylim(0,400)+
guides(color='none')+
labs(title = 'Outliers Replaced') +
scale_colour_discrete(na.translate = F)
plot = clean_full_y
clean_full_y
```
```{r}
```
### FITTING THE PROPHET MODEL
#### The model is fitted on train data till date 2015-07-01 and forecast is done fo period of 120 days.
```{r,echo=FALSE,include=FALSE,warning=FALSE}
library(prophet)
prophet_sales = sales %>%
rename(ds = date, # Have to name our date variable "ds"
y = daily_Sales) # Have to name our time series "y"
train = prophet_sales %>%
filter(ds%
filter(ds>=ymd("2015-07-01"))
model = prophet(train)
future = make_future_dataframe(model,periods = 120)
forecast = predict(model,future)
```
Model automatically disable daily seasonality as it detects there could be no daily seasonality present.
```{r,echo=FALSE}
plot(model,forecast)+
ylab("TOTAL DAILY SALES")+xlab("Date")+theme_bw()
```
```{r}
```
We try check if there is any difference by turning on the daily seasonality or not.
#### Ckeck by Turning on the daily seasonality
```{r,echo=FALSE,include=FALSE,warning=FALSE}
model_1 = prophet(train,daily.seasonality = TRUE)
future_1 = make_future_dataframe(model,periods = 120)
forecast_1 = predict(model,future)
```
.
```{r,echo=FALSE,warning=FALSE,message=FALSE}
plot(model_1,forecast)+
ylab("TOTAL DAILY SALES")+xlab("Date")+theme_bw()
```
Not much of a difference here. Thus model correctly disabled the daily seasonality
#### Lets see our original series for comparison
```{r,echo=FALSE,warning=FALSE,message=FALSE}
ggplot()+
geom_point(aes(date,daily_Sales))+
theme_bw()
```
```{r}
```
#### Interactive chart to check for perticular period. (Use scroller at bottom)
```{r,echo=FALSE,warning=FALSE,message=FALSE}
dyplot.prophet(model,forecast)
```
### CHECKING FOR COMPONENTS OF TIME SERIES
#### Plot shows the different components which includes, trend, weekly seasonality and yearly seasonality.
```{r,echo=FALSE}
prophet_plot_components(model,forecast)
```
#### Checking changepoint with phophet model
#### Plot shows that two three equidistant change points. The model includes one change point found using boxplot.
```{r,echo=FALSE,include=FALSE}
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Total Sales")
```
```{r,echo=FALSE,warning=FALSE,message=FALSE}
model = prophet(train,n.changepoints=5)
forecast_2 = predict(model,future)
plot(model,forecast_2)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Daily Sales")
```
#### COMPARING FORECAST
#### Comparing forecast with actual observed data. The model fits well but has greater variance at start.
```{r,echo=FALSE,warning=FALSE}
forecast_plot_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2015-07-01"))
forecast_not_scaled = ggplot()+
geom_line(aes(test$ds,test$y))+
geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red')+
xlab("date")+
ylab("Total sales")
forecast_scaled = forecast_not_scaled
forecast_scaled
```
#### Trying to forecast it for 250 days
```{r,echo=FALSE,warning=FALSE}
two_yr_future = make_future_dataframe(model,periods = 250)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_bw()+xlab("Date")+ylab("Total Sales")
```
```{r,eval=FALSE,include=FALSE,echo=FALSE}
train$floor = 0
train$cap = 25
future$floor = 0
future$cap = 25
# Set floor in forecsat data
two_yr_future$floor = 0
two_yr_future$cap = 25
#sat_model = prophet(train,growth='logistic',daily.seasonality = T)
sat_two_yr_forecast = predict(sat_model,two_yr_future)
```
.
```{r}
```
```{r,eval=FALSE,include=FALSE,echo=FALSE}
plot(sat_model,sat_two_yr_forecast)+ylim(0,25)+
theme_bw()+xlab("Date")+ylab("Daily Sales")
```
```{r,echo=FALSE,warning=FALSE,eval=FALSE}
prophet_plot_components(model,forecast)
```
The prophet is considering seasonality correctly, We dont need to take account for daily seasonality here. We can check for additive and multiplicative seasonality for considration.
### SEASONALITY
#### Additive Seasonality
```{r,echo=FALSE,warning=FALSE}
additive = prophet(train)
add_fcst = predict(additive,future)
plot(additive,add_fcst)
#summary(add_fcst)
#summary(train)
```
#### Plotting additive components.
```{r,echo=FALSE,warning=FALSE}
prophet_plot_components(additive,add_fcst)
```
#### Multiplicative seasonality
```{r,echo=FALSE}
multi = prophet(train,seasonality.mode = 'multiplicative')
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)
```
#### Multiplicative components
```{r,echo=FALSE}
prophet_plot_components(multi, multi_fcst)
```
#### There is not much of difference between both.
#### Checking Holiday Impact
#### Leveraging the holiday list available in the prophet model,We could see below which holidays has greater Impact.
```{r,echo=FALSE,warning=FALSE,message=FALSE}
#?prophet()
model = prophet(train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)
forecast = predict(model,future)
prophet_plot_components(model,forecast)
```
#### Checking the amount of impact of two main holidays. New year and Christmas
```{r,echo=FALSE,warning=FALSE}
forecast %>%
filter(
holidays!=0,
ds == ymd("2014-12-25") |
ds == ymd("2015-01-01")) %>%
select(ds,`Christmas Day`,`New Year's Day`)
```
#### Holiday Impact by each holiday in the US calender.
```{r,echo=FALSE,warning=FALSE}
forecast %>%
filter(holidays != 0) %>%
dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
pivot_longer(
cols = `Christmas Day`:`Washington's Birthday`,
names_to = 'holiday',
values_to = 'effect'
) %>%
ggplot() +
geom_col(aes(effect,holiday))+
theme_bw()
```
### MODEL 'FIT' & CROSS VALIDATION
#### Checking In sample performance by Calculating the MSE,MAE and MAPE of the model.
#### The error values are not large, the fit is good. Generally, Prophet works good on daily data and with seasonality.
```{r,echo=FALSE,warning=FALSE}
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds)) %>%
filter(ds>=ymd("2015-07-01"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(test$y - forecast_metric_data$yhat))
MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))
print(paste("RMSE:",round(RMSE,2)))
print(paste("MAE:",round(MAE,2)))
print(paste("MAPE:",round(MAPE,2)))
```
#### Cross Validation
Cross- Validation for the window of 30 days.
```{r,echo=FALSE,warning=FALSE,message=FALSE,include=FALSE}
df.cv <- cross_validation(model, initial = 700, period = 30, horizon = 30, units = 'days')
head(df.cv)
```
#### Cut-off points for the cross validation
```{r,echo=FALSE}
unique(df.cv$cutoff)
```
#### Plotting the comparison of cross validation and actual for different cut-off points.
```{r,echo=FALSE}
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("Total sales")+
scale_color_discrete(name = 'Cutoff')
```
### Conclusion
#### This is our Final part. Lets conclude our best model after comparing in sample and out of sample performance. Here we have our best model as ARIMA(7,0,0) which we have chosen by trial and error. The auto model is model suggested by auto.arima and last is model by using prophet library.
#### Cross validation plot for PROPHET
```{r,echo=FALSE}
plot_cross_validation_metric(df.cv, metric = 'rmse')
```
```{r,echo=FALSE}
#dim(sales)
out_list = lapply(828:1004,function(x){
train = sales$daily_Sales[1:x]
test = sales$daily_Sales[(x+1):(x+29)]
train_mod1 = arima(train,order=c(7,0,0))
train_mod2 = arima(train,order=c(1,0,2))
out_data = data.frame(
train_thru = x,
time_ahead = 1:29,
time_point = x + 1:29,
actual = test,
naive_fcst = train[x],
test_pred1 = predict(train_mod1,29)$pred,
test_pred2 = predict(train_mod2,29)$pred,
error1 = test - predict(train_mod1,29)$pred,
error2 = test - predict(train_mod2,29)$pred
)
})
out_data = rbindlist(out_list)
```
```{r,echo=FALSE,include=FALSE}
out_data %>%
as_tibble()
```
```{r,echo=FALSE,message=FALSE,warning=FALSE}
error_metrics = out_data %>%
mutate(naive_error = actual - naive_fcst) %>%
pivot_longer(
cols = error1:naive_error,
values_to = 'error'
) %>%
group_by(time_ahead,name) %>%
summarize(
RMSE = sqrt(mean(error^2)),
MAE = mean(abs(error)),
MAPE = mean(abs(error / actual))
) %>%
ungroup()
```
#### CROSS VALIDATION GRAPH FOR ARIMA MODELS
```{r,echo=FALSE}
error_metrics %>%
ggplot()+
geom_line(aes(time_ahead,RMSE,color=name))+
xlab("Horizon (days)")
```
#### COMPARING IN SAMPLE RMSE AND MAE
```{r,echo=FALSE}
xyz = data.frame(
Model = c("ARIMA","AUTO ARIMA","PROPHET"),
`In Sample RMSE` = round(c(RMSE_best,RMSE_auto,RMSE),3),
`In Sample MAE` = round(c(MAE_best,MAE_auto,MAE),3)
#`Out of Sample RMSE` = round(c(out_rmse1,out_rmse2,out_rmse3,out_rmse9),3)
)
#%>%
#datatable()
xyz
```
#### LOOKING ALL THE PLOTS AND VALUES ABOVE, WE CONCLUDE THAT PROPHET HAS OUTPERFORMED ALL THE MODELS. THE PROPHET HAS PROVED ITS PREDICTIVE POWER FOR SEASONAL DATA SETS.
#### THUS OUR WINNER IS 'PROPHET' MODEL.