Introduction

A Kaggle competition Data set consisting of the Weekly Sales Data of Walmart, was taken for analysis. It consists of the historical sales data for 45 of Walmart stores across the country. It’s dates are spread for almost 3 years for the period between 2010 and 2012. The dataset has mentioned the last date of the week of the Week (i.e. Friday), Store Number, Department Number, and whether the week is a holiday week or no.

Goal

As the dataset consists of multiple stores and it’s multiple departments, our goal is to choose one store and department at random, and to analyse it.

Note:

While working on this retail dataset, few things to expect based on the intuition. Seasonality can be expected, because of the yearly holidays, etc. From the basic EDA, it seems like the dataset may contain some erroneous values, and hence they must be cleaned before starting with the time-series forecasting.Also the time-series forecasting would be done on a single store and to one of it’s top departments, chosen at random.

Importing useful Libraries for EDA like:

dplyr
ggplot2
forecast
vtable
tidyverse

SECTION - 1: Exploratory Data Analysis

Taking Input File:

#Reading in the CSV File into the DataFrame
dataf=read.csv("C:/Users/MauP/OneDrive/Desktop/Spring Semester/Flex 1/Forecasting/Project Data - Walmart Sales/Weekly_Sales.csv")

The number of records in the dataset:

## [1] 421570

Looking at the Top 6 Rows (To understand the Data):

##   Store Dept       Date Weekly_Sales IsHoliday
## 1     1    1 2010-02-05     24924.50     FALSE
## 2     1    1 2010-02-12     46039.49      TRUE
## 3     1    1 2010-02-19     41595.55     FALSE
## 4     1    1 2010-02-26     19403.54     FALSE
## 5     1    1 2010-03-05     21827.90     FALSE
## 6     1    1 2010-03-12     21043.39     FALSE

Structure of the Table:

## 'data.frame':    421570 obs. of  5 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" ...
##  $ Weekly_Sales: num  24925 46039 41596 19404 21828 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...

Converting the Date’s datatype to ‘Date’.

#Converting the DataType of the Date Column 
dataf$Date <- as.Date(dataf$Date)

Summary Statistics of the Table:

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	421570	22.201	12.785	1	11	22	33	45
Dept	421570	44.26	30.492	1	18	37	74	99
Weekly_Sales	421570	15981.258	22711.184	-4988.94	2079.65	7612.03	20205.853	693099.36
IsHoliday	421570
… No	391909	93%
… Yes	29661	7%

Luckily, there aren’t any Null values in the dataset, but on a closer look, Negative Values are visible in the Weekly Sales.

Number of Negative Values:

## [1] 1285

There are 1285 rows of negative values out of a data of 42,157,042 rows, which is just 0.3% of the entire dataset, and hence they can be truncated.

Removing Null (NA) Values:

dataf <- dataf[dataf$Weekly_Sales>0,]

Summary Statistics of the Table (After Removing Null Values):

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	420212	22.196	12.787	1	11	22	33	45
Dept	420212	44.241	30.509	1	18	37	74	99
Weekly_Sales	420212	16033.115	22729.492	0.01	2120.13	7661.7	20271.265	693099.36
IsHoliday	420212
… No	390652	93%
… Yes	29560	7%

The dataset is Cleaned for Exploring now!

Finding Correlations in Data:

Impact of Holidays on the Store Sales:

#Exploring the effect of Holiday
ggplot(dataf, aes(x = Store, y = Weekly_Sales))+ 
  geom_jitter(aes(colour = IsHoliday)) + 
  xlab('Store') +
  ylab('Sales Per Week') +
  ggtitle("Store Weekly Sales - Comparing the Effect of Holiday") + 
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )

Inferences from the Plot:

As expected, the factor of it being a holiday affects the weekly sales for all the stores.
As all the stores are plotted here, it can be seen that some stores numbered 10-20, seem to have higher weekly sales as compared to other stores.

Impact of Holidays on the Department Sales:

Inferences from the Plot:

A drastic difference in sales for different departments is visible.
Departments like 12, and 70, show a huge spike in sales, of which department 12 seems to have no effect due to holiday, but department 70 seems to have a huge impact of a holiday.

Boxplotting Weekly sales for both Holiday and not a Holiday:

boxplot(dataf$Weekly_Sales ~ dataf$IsHoliday, dataf,
        main='Weekly Sales Comparison - Holidays vs Normal Days', 
        xlab = 'Holiday, True or False',
        ylab='Weekly Sales',
        frame = TRUE,
        horizontal = FALSE,
        color = "green")

Both of the boxplots have outliers, with not a big difference in Sales, for both Holday and Normal Day

Plotting the Average Store Sales, in order to Select the Top 5:

#Plotting the Stores with the highest Average Store Sales, to select the one of the top 5 stores
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Avg_Store_Sales), y=Avg_Store_Sales)) +
  geom_bar(stat="identity", fill="brown") +
  xlab('Store') +
  ylab('Average Store Sales')

The Top Stores, as per their Average sales are:

Stores: 20, 4, 14, 13,and 2

Total Sales - Store Comparison:

dum_dataf <- dum_dataf %>% arrange(desc(Tot_Sales))
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Tot_Sales), y=Tot_Sales)) +
  geom_bar(stat="identity", fill="darkblue") +
  xlab('Store') +
  ylab('Total Store Sales')

The Top Stores, as per their Total sales are:

Stores: 20, 4, 14, 13,and 2

NOTE: The observations are similar to average sales.

For Univariate Analysis, Only one Store will be worked upon. One of the top 5 rows will be selected at random.

Selecting a Random Store out of the Top 5:

# Setting a Seed for Replication of Results:
set.seed(70500)
StoreID <- sample(c(20,4,14,13,2), 1)

Random Store Selected:

## [1] 20

Head of the Selected Store for Analysis:

## # A tibble: 6 x 4
## # Groups:   Store [1]
##   Store  Dept Avg_Store_Sales Tot_Sales
##   <int> <int>           <dbl>     <dbl>
## 1    20     1          40545.  5798003.
## 2    20     2          78251. 11189929.
## 3    20     3          15491.  2215209.
## 4    20     4          51456.  7358262.
## 5    20     5          41648.  5955633.
## 6    20     6           8211.  1174137.

Summary Statistics of the Selected Store:

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	78	20	0	20	20	20	20	20
Dept	78	45.949	29.864	1	21.25	41.5	73.5	99
Avg_Store_Sales	78	27071.845	33651.788	9.667	4387.255	13941.789	38508.85	164633.742
Tot_Sales	78	3864120.275	4816272.294	29	624238.695	1977692.54	5506765.608	23542625.04

Department-Wise Total Sales:

Selecting a Random Department of the Top 5 For Analysis:

Top_5_Dept <- Store_Analysis %>% top_n(5)
DeptID <- sample(c(Top_5_Dept$Dept), 1)

Random Department Selected:

## [1] 92

New Dataset’s Top Rows:

##         Date Weekly_Sales
## 1 2010-02-05     195223.8
## 2 2010-02-12     170043.5
## 3 2010-02-19     164314.3
## 4 2010-02-26     147699.7
## 5 2010-03-05     169171.2
## 6 2010-03-12     161433.3

New Dataset’s Summary Statistics:

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	78	20	0	20	20	20	20	20
Dept	78	45.949	29.864	1	21.25	41.5	73.5	99
Avg_Store_Sales	78	27071.845	33651.788	9.667	4387.255	13941.789	38508.85	164633.742
Tot_Sales	78	3864120.275	4816272.294	29	624238.695	1977692.54	5506765.608	23542625.04

Plotting - Time Series of the Weekly Sales:

ggplot(Store_dataf, aes(x=Date, y=Weekly_Sales)) +
  geom_line(color="brown") + 
  geom_smooth(method=lm) +
  xlab("Weeks of the Year") +
  ylab("Weekly Sales")

Inferences:

Hints of yearly and monthly seasonality are observed in the Time-Series.

Fitting a Linear Regression Model on Weekly Sales on Dates:

Model1 = lm(Weekly_Sales ~Date)
summary(Model1)

## 
## Call:
## lm(formula = Weekly_Sales ~ Date)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45977 -14521    328  11815  50841 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.237e+05  8.152e+04  -2.744  0.00686 ** 
## Date         2.565e+01  5.383e+00   4.764 4.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18600 on 141 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1326 
## F-statistic:  22.7 on 1 and 141 DF,  p-value: 4.659e-06

Section 2 - Time Series Regression (ARIMA Modeling)

Assessing the behavior/data-generating process of the time-series

To understand the data-generating process in an overview, we can find correlations among different lags, before jumping further.

Correlation with Different Lags:

Series - Stationary Test:

ADF TEST:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_dataf$Weekly_Sales
## Dickey-Fuller = -2.4672, Lag order = 5, p-value = 0.3819
## alternative hypothesis: stationary

As expected the Data seems to be a Non-Stationary series.

As the data has variance non-stationary, We must make the Variance Stationary.

Log & Box Cox Transforming (To Make it stationary)

plot1 <- ggplot(Store_dataf, aes(x=Date, y=log1p(Weekly_Sales))) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Log Transformed Weekly Sales") +
  theme_minimal()

# Box Cox Transforming it to make it stationary
Store_dataf_box = Store_dataf %>%
                  mutate(close_boxcox = forecast::BoxCox(Store_dataf$Weekly_Sales, 
                lambda = 'auto'))

plot2 <- ggplot(Store_dataf_box, aes(x=Date, y=close_boxcox)) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Close Box") +
  theme_minimal()

Comparing Both the Plots:

As can be seen that both the transformations have resulted in a similar graph with a stationary variance. Thus for it’s least complexity going ahead with the Log Transformation, as in the plot above.

KPSS Test:

## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_dataf$Weekly_Sales
## KPSS Level = 1.2585, Truncation lag parameter = 4, p-value = 0.01

As the mean isn’t stationary, we will try to make it stationary using Differencing.

First Order Differencing:

# First Order Differencing
Store_log_1d <- Store_dataf %>%
                mutate(Sale_log_1d = log1p(Weekly_Sales)-lag(log1p(Weekly_Sales)))

Plots With V/S Without First Order Differencing:

plot1_1d <- ggplot(Store_log_1d, aes(x=Date, y=Sale_log_1d)) +
  geom_line(color="brown") + 
    xlab("Weeks") +
    ylab("Log of Weekly Sales") +
    theme_minimal() 

#Comparing the Stationary plot with the previous plot
plot1 / plot1_1d

Inferences:

The Plot looks pretty stationary now, In Comparison.

Testing Stationarity Again:

ADF Test:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_log_1d$Sale_log_1d
## Dickey-Fuller = -6.1492, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

KPSS Test:

## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_log_1d$Sale_log_1d
## KPSS Level = 0.060025, Truncation lag parameter = 4, p-value = 0.1

The ADF and KPSS Tests Suggests that the Data is now stationary. Visually an yearly seasonality has been observed in the data.

Analyzing - Time Series by ACF/PACF Plots:

ACF PLOT:

PACF PLOT:

Looking at the The Lag, it appears to be of AutoRegression - AR(4). Analyzing further.

Testing AIC And BIC for different ARIMA Models:

##                                                     df       AIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -284.2964
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -282.3171
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -284.7833
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -286.9875
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -273.6001
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -281.2913
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -286.6870
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -285.0091

Best Suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2) (based on AIC Values)

##                                                     df       BIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -266.5614
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -261.6263
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -261.1367
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -260.3851
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -258.8209
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -263.5564
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -265.9962
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -261.3624

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)

NOTE: The model is First Order differentiated.

Using auto.arima to get the ARIMA values:

model1 <- auto.arima(Store_log_1d$Sale_log_1d,stationary=FALSE,allowdrift=FALSE, seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

Plotting - Residuals for model 1 to check for White Noise:

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2) with zero mean
## Q* = 2.0442, df = 5, p-value = 0.843
## 
## Model df: 5.   Total lags used: 10

Inferences:

Auto Arima also suggests using ARIMA(3,0,2).

NOTE: The model is First Order differentiated.

Also modelling using ARIMA(4,0,0) as the original model.

# Checking residuals for 4,0,0
model2 <- arima(Store_log_1d$Sale_log_1d,order=c(4,0,0))

Plotting - Residuals for Model 2 to check for White Noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,0) with non-zero mean
## Q* = 5.1099, df = 5, p-value = 0.4026
## 
## Model df: 5.   Total lags used: 10

Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation, and we have a clear winner.

ACF and PACF Plots for the Residuals:

best_model <- model1
par(mfrow=c(1,2))
resid = best_model$residuals
acf(resid,lag.max=10)
pacf(resid,lag.max=10)

The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.

Conducting the Box-Ljung Test to verify that there is no autocorrelation.

Ljung-Box test

Lag = 1:

Box.test(resid,type='Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.0039211, df = 1, p-value = 0.9501

Lag = 5:

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.97357, df = 5, p-value = 0.9647

Lag = 10:

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.0442, df = 10, p-value = 0.996

Lag = 20:

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 9.7661, df = 20, p-value = 0.9722

Prediction Plot to the Actual Sales Data:

# Calculating Residuals to the Sales data, to find the Predictions
pred = Store_log_1d$Sale_log_1d - resid

ggplot()+
  geom_line(aes(Store_log_1d$Date,Store_log_1d$Sale_log_1d))+
  geom_line(aes(Store_log_1d$Date,pred),color='brown',alpha=0.6)+
  xlab("Weeks")+
  ylab("Weekly Sales (Log Transformed)") +
  theme_minimal()

Inferences:

The prediction seems to be pretty much closely aligned to the actual sales.

In-Sample Root Mean Square Error for the Model:

# The in-sample RMSE of the model based on the residuals.
RMSE = sqrt(mean((expm1(pred) - expm1(Store_log_1d$Sale_log_1d))^2,na.rm=T))
RMSE

## [1] 0.07927394

Forecasting

Forecast for 5 Time-Periods (5 Weeks):

best_model %>%
  forecast(h=5) %>% 
  autoplot()

As the Forecast for 5 Time-Periods is pretty small, Forecasting for 20 time-periods, as in 20 weeks ahead.

Forecast for 20 Time-Periods:

The forecast here does go along with the actual data, like always, the pattern is pretty consistent, although we see the forecast doesn’t account for the seasonality that well. And it starts diminishing towards Sales Log of 0.

Section - 3: Time Series Regression (Facebook Prophet Model)

Installing the Prophet Library.

library(prophet)

Preparing the Data For Prophet

# The Prophet Modelling Requires us to Re-Name our Date variable as "ds"
prophet_data = Store_dataf %>% 
  rename(ds = Date, 
         y = Weekly_Sales)

# Dividing the Dataset into Train and Test Dataframes.
train = prophet_data %>% 
  filter(ds<ymd("2012-01-01"))
test = prophet_data %>%
  filter(ds>=ymd("2012-01-01"))

Building the Prophet Model:

model = prophet(train, yearly.seasonality=TRUE)

Prophet Forecast - Plot:

future = make_future_dataframe(model,periods = 300)
forecast = predict(model,future)

plot(model,forecast)+
  ylab("Weekly Sales (In $)")+xlab("Date")+theme_minimal()

Using DyPlot - A more closer Look:

dyplot.prophet(model,forecast)

Seeing Components of the Prophet Model’s Forecast -

The First Trend, shows the overall general trend of the Weekly Sales
The Second Chart, shows the seasonal trends for every year.

Plotting the Components of the Prophet Model:

prophet_plot_components(model,forecast)

Inferences:

An overall increasing trend is seen.
No specific Weekly Trend is visible.
Monthly Trends are slightly visible.
Yearly Trend of increasing at start and end of the year are visible.

Investigating Chagepoints:

plot(model,forecast)+add_changepoints_to_plot(model)+theme_minimal()+xlab("Dates")+ylab("Weekly Sales")

We see only one Changepoint in the 2 years of data, and it seems reasonable as not many changing impacts could be seen in the seasonal trends.

Forecasting Further Months (Split - Train Data):

The Forecast seems to be fitting in somewhat accordance to the actual datapoints. It still needs to be checked with it’s scaled version.

Scaled Version (after clearing seasonality and trends):

Even after plotting the scaled version, we observe seasonality.

Saturating Minimum/Maximum Point - To Incorporate Limits(If Necessary)

Forecasting Two years into the Future:

two_yr_future = make_future_dataframe(model,periods = 730)
two_yr_forecast = predict(model,two_yr_future)
plot(model,two_yr_forecast)+theme_minimal()+xlab("Dates")+ylab("Weekly Sales")

Even after forecasting 2 years in the future, we don’t see the weekly sales to be going extremely high or we don’t get much outliers. This can help us understand that we don’t required Ceiling or Flooring.

Checking for Seasonality - Additive + Multiplicative:

Additive Seasonality:

Additive Model’s Components:

Multiplicative Seasonality:

Multiplicative Model’s Components:

Assessing Holiday Effect on Weekly Sales:

We expect it to show some effect, as holidays generally affect people going shopping, and for stores like Walmart, we expect the sales to go higher or lower than usual.

NOTE:

The data is of of every Friday for all the years, hence this investigating of Weekly Sales, wouldn’t give an entire picture, but just a basic idea.

model = prophet(train,fit=FALSE,yearly.seasonality = TRUE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,train)

Components of the Forecast with Holidays:

As expected, we aren’t able to see any significant effect of holidays, except for the year end, Christmas, and New Year related holidays.

More Clear Impact of Holidays on Sales:

Inferences:

As per our expectations, we observe that New Year seems to have a big effect on sales, wtih it dropping drastically.
However Christmas and Veteran’s Day seem to have a positive impact on the Weekly Sales.

Model Performance (RMSE , MAE, MAPE):

RMSE Score:

## [1] "RMSE: 19430.44"

MAE Score:

## [1] "MAE: 15724.02"

MAPE Score:

## [1] "MAPE: 0.09"

It is evident that MAPE is around 9% which seems understandable, considering the Weekly Sales data of the Walmart store.

Cross-Validation:

Cross-Validation Forecasting Model:

Forecasting for 10 Periods:

df.cv <- cross_validation(model, initial = 400, horizon=10, period=10, units = 'days')

Forecast with Cross-Validation:

Plotting For the Horizon:

No clear change in Variance is observed.

Comparing Additive and Multiplicative seasonality:

Building Models for Additive Seasonality:

mod1 = prophet(train, yearly.seasonality = TRUE, seasonality.mode='additive')

forecast1 = predict(mod1)

df_cv1 <- cross_validation(mod1, initial = 400, horizon=10, period=10, units = 'days')

metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'mod1')

Building Models for Multiplicative Seasonality

mod2 = prophet(train, yearly.seasonality = TRUE, seasonality.mode='multiplicative')


forecast2 = predict(mod2)
df_cv2 <- cross_validation(mod2, initial = 400, horizon = 10, period=10, units = 'days')

metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "mod2")

Visualizing Both:

Both the seasonality’s seem to have similar RMSE, and thus we can conclude the seasonality to be additive of nature, as Multiplicative Seasonality seems to add nothing of significance to the model.

Section 4: Model Comparison - Best ARIMA Model V/S Prophet

ARIMA Model:

best_model_arima <- auto.arima(train$y,stationary=FALSE,allowdrift=FALSE,
                 seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,2)
## Q* = 0.80655, df = 5, p-value = 0.9766
## 
## Model df: 5.   Total lags used: 10

As observed in the total data, the suggestion by Auto-Arima of ARIMA(3,1,2) on the training dataset as well is understandable.

NOTE: This model is on the Train Dataset.

Out of Sample RMSE for ARIMA Model:

test_pred = predict(best_model_arima, 43)
test_pred_arima = test_pred$pred
error_arima = test$y - test_pred_arima
out_rmse_arima=sqrt(mean(error_arima^2,na.rm=T))

out_rmse_arima

## [1] 24962.41

Out of Sample RMSE for Prophet Model:

best_mod_prophet = prophet(train, weekly.seasonality=FALSE, daily.seasonality = FALSE, 
                yearly.seasonality=TRUE)
future = make_future_dataframe(model,periods = 26)
forecast = predict(best_mod_prophet,future)

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2012-01-01"))

out_rmse_prophet = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

out_rmse_prophet

## [1] 24571.93

RMSE - ARIMA V/S Prophet:

tibble(
  `ARIMA` = round(out_rmse_arima,2),
  `Prophet` = round(out_rmse_prophet,2)
)

## # A tibble: 1 x 2
##    ARIMA Prophet
##    <dbl>   <dbl>
## 1 24962.  24572.

Although the RMSE’s are pretty close (24962, and 24571), the Prophet’s RMSE seems to be the better choice here when compared to the RMSE of ARIMA Model.

When we look at both the Model’s in general, Prophet does handle the seasonality’s better as when compared to ARIMA Modelling, but still it is also advisable to model using ARIMA as well. This is because, decomposing the Time-Series gives us a more in-depth understanding of the data-generating process and also the AR characteristics of the model.

Forecasting - Final Project

Maulik Patel

26 February 2022

Introduction

Goal

Note:

Importing useful Libraries for EDA like:

SECTION - 1: Exploratory Data Analysis

Taking Input File:

Summary Statistics of the Table:

Number of Negative Values:

Removing Null (NA) Values:

Summary Statistics of the Table (After Removing Null Values):

The dataset is Cleaned for Exploring now!

Finding Correlations in Data:

Impact of Holidays on the Store Sales:

Inferences from the Plot:

Impact of Holidays on the Department Sales:

Inferences from the Plot:

Boxplotting Weekly sales for both Holiday and not a Holiday:

Plotting the Average Store Sales, in order to Select the Top 5:

The Top Stores, as per their Average sales are:

Total Sales - Store Comparison:

The Top Stores, as per their Total sales are:

NOTE: The observations are similar to average sales.

Selecting a Random Store out of the Top 5:

Random Store Selected:

Head of the Selected Store for Analysis:

Summary Statistics of the Selected Store:

Department-Wise Total Sales:

Selecting a Random Department of the Top 5 For Analysis:

Random Department Selected:

New Dataset’s Top Rows:

New Dataset’s Summary Statistics:

Plotting - Time Series of the Weekly Sales:

Inferences:

Fitting a Linear Regression Model on Weekly Sales on Dates:

Section 2 - Time Series Regression (ARIMA Modeling)

Assessing the behavior/data-generating process of the time-series

Correlation with Different Lags:

Series - Stationary Test:

ADF TEST:

Log & Box Cox Transforming (To Make it stationary)

Comparing Both the Plots:

KPSS Test:

First Order Differencing:

Plots With V/S Without First Order Differencing:

Inferences:

Testing Stationarity Again:

ADF Test:

KPSS Test:

Analyzing - Time Series by ACF/PACF Plots:

ACF PLOT:

PACF PLOT:

Testing AIC And BIC for different ARIMA Models:

Best Suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2) (based on AIC Values)

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)

Using auto.arima to get the ARIMA values:

Plotting - Residuals for model 1 to check for White Noise:

Inferences:

Plotting - Residuals for Model 2 to check for White Noise.

Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation, and we have a clear winner.

ACF and PACF Plots for the Residuals:

The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.

Conducting the Box-Ljung Test to verify that there is no autocorrelation.

Ljung-Box test

Prediction Plot to the Actual Sales Data:

Inferences:

In-Sample Root Mean Square Error for the Model:

Forecasting

Forecast for 5 Time-Periods (5 Weeks):

Forecast for 20 Time-Periods:

Section - 3: Time Series Regression (Facebook Prophet Model)

Preparing the Data For Prophet

Building the Prophet Model:

Prophet Forecast - Plot:

Using DyPlot - A more closer Look:

Seeing Components of the Prophet Model’s Forecast -

Plotting the Components of the Prophet Model:

Inferences: