Introduction

A Kaggle competition Data set consisting of the Weekly Sales Data of Walmart, was taken for analysis. It consists of the historical sales data for 45 of Walmart stores across the country. It’s dates are spread for almost 3 years for the period between 2010 and 2012. The dataset has mentioned the last date of the week of the Week (i.e. Friday), Store Number, Department Number, and whether the week is a holiday week or no.

Goal

As the dataset consists of multiple stores and it’s multiple departments, our goal is to choose one store and department at random, and to analyse it.

Note

While working on this retail dataset, few things to expect based on the intuition. Seasonality can be expected, because of the yearly holidays, etc. From the basic EDA, it seems like the dataset may contain some erroneous values, and hence they must be cleaned before starting with the time-series forecasting.Also the time-series forecasting would be done on a single store and to one of it’s top departments, chosen at random.

Importing useful Libraries for EDA like:

1. dplyr

2. ggplot2

3. forecast

4. vtable

5. tidyverse

Taking Input File

#Changing Working Directory to the required output
setwd("C:/Users/MauP/OneDrive/Desktop/Spring Semester/Flex 1/Forecasting/Project Data - Walmart Sales")

#Reading in the CSV File into the DataFrame
dataf=read.csv("Weekly_Sales.csv")

Let us see the number of records of the dataset:

## [1] 421570

Let us now see, how the data looks like, by seeing the top 5 rows.

##   Store Dept       Date Weekly_Sales IsHoliday
## 1     1    1 2010-02-05     24924.50     FALSE
## 2     1    1 2010-02-12     46039.49      TRUE
## 3     1    1 2010-02-19     41595.55     FALSE
## 4     1    1 2010-02-26     19403.54     FALSE
## 5     1    1 2010-03-05     21827.90     FALSE
## 6     1    1 2010-03-12     21043.39     FALSE

Looking for the Structure of the Table

## 'data.frame':    421570 obs. of  5 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" ...
##  $ Weekly_Sales: num  24925 46039 41596 19404 21828 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...

Converting the Date’s DataType to ‘Date’.

#Converting the DataType of the Date Column 
dataf$Date <- as.Date(dataf$Date)

Let us see the Summary Statistics of the table.

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	421570	22.201	12.785	1	11	22	33	45
Dept	421570	44.26	30.492	1	18	37	74	99
Weekly_Sales	421570	15981.258	22711.184	-4988.94	2079.65	7612.03	20205.853	693099.36
IsHoliday	421570
… No	391909	93%
… Yes	29661	7%

Luckily, there aren’t any Null values in the dataset, but on a closer look, Negative Values are visible in the Weekly Sales.

Finding the Number of Negative Values.

## [1] 1285

There are 1285 rows of negative values out of a data of 42157042 rows, which is just 0.3% of the entire dataset, and hence can be truncated.

Removing Null(NA) Values.

dataf <- dataf[dataf$Weekly_Sales>0,]

Let us see the Summary Statistics of the table again.

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	420212	22.196	12.787	1	11	22	33	45
Dept	420212	44.241	30.509	1	18	37	74	99
Weekly_Sales	420212	16033.115	22729.492	0.01	2120.13	7661.7	20271.265	693099.36
IsHoliday	420212
… No	390652	93%
… Yes	29560	7%

The dataset is cleaned now.

Let us now look for the effect of Holidays on the Store Sales

#Exploring the effect of Holiday
ggplot(dataf, aes(x = Store, y = Weekly_Sales))+ 
  geom_jitter(aes(colour = IsHoliday)) + 
  xlab('Store') +
  ylab('Sales Per Week') +
  ggtitle("Store Weekly Sales - Comparing the Effect of Holiday") + 
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )

Inferences from the plot:

1. As expected, the factor of it being a holiday affects the weekly sales for all the stores.

2. As all the stores are plotted here, it can be seen that some stores numbered 10-20, seem to have higher weekly sales as compared to other stores.

Let us now look for the effect of Holidays on the Department Sales

ggplot(dataf, aes(x=Dept, y=Weekly_Sales)) + 
  geom_jitter(aes(colour = IsHoliday)) + 
  xlab('Deptartment Number') +
  ylab('Sales Per Week') +
  ggtitle("Department Sales - Comparing the Effect of Holiday ") +
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )

Inferences from the plot:

1. A drastic difference in sales for different departments is visible.

2. Departments like 12, and 70, show a huge spike in sales, of which department 12 seems to have no effect due to holiday, but department 70 seems to have a huge impact of a holiday.

Using Boxplots to look for any potential Outliers.

boxplot(dataf$Weekly_Sales ~ dataf$IsHoliday, dataf,
        main='Weekly Sales Comparison - Holidays vs Normal Days', 
        xlab = 'Holiday, True or False',
        ylab='Weekly Sales',
        frame = TRUE,
        horizontal = FALSE,
        color = "green")

Both of the boxplots have outliers, with not a big difference in Sales, for both Holday and Normal Day

Let us plot the Average Store Sales, in order to Select the Top 5

#Plotting the Stores with the highest Average Store Sales, to select the one of the top 5 stores
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Avg_Store_Sales), y=Avg_Store_Sales)) +
  geom_bar(stat="identity", fill="brown") +
  xlab('Store') +
  ylab('Average Store Sales')

The Top Stores, as per their Average sales are: 20,4,14,13,2

Also Checking for Total Sales

dum_dataf <- dum_dataf %>% arrange(desc(Tot_Sales))
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Tot_Sales), y=Tot_Sales)) +
  geom_bar(stat="identity", fill="darkgreen") +
  xlab('Store') +
  ylab('Total Store Sales')

Even for Total Sales, The Top Stores remain: 20,4,14,13,2

Selecting a Random Store out of the top 5

set.seed(70500)
StoreID <- sample(c(20,4,14,13,2), 1)
StoreID

## [1] 20

# Choosing a random store for analysis.
Store_Analysis <- dataf %>% 
                  filter(Store==StoreID) %>% 
                  group_by(Store, Dept) %>% 
                  summarise(Avg_Store_Sales=mean(Weekly_Sales), 
                  Tot_Sales=sum(Weekly_Sales))

# Plotting the Head of the Store_Analysis
head(Store_Analysis)

## # A tibble: 6 x 4
## # Groups:   Store [1]
##   Store  Dept Avg_Store_Sales Tot_Sales
##   <int> <int>           <dbl>     <dbl>
## 1    20     1          40545.  5798003.
## 2    20     2          78251. 11189929.
## 3    20     3          15491.  2215209.
## 4    20     4          51456.  7358262.
## 5    20     5          41648.  5955633.
## 6    20     6           8211.  1174137.

Summary Statistics of the Selected Store

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	78	20	0	20	20	20	20	20
Dept	78	45.949	29.864	1	21.25	41.5	73.5	99
Avg_Store_Sales	78	27071.845	33651.788	9.667	4387.255	13941.789	38508.85	164633.742
Tot_Sales	78	3864120.275	4816272.294	29	624238.695	1977692.54	5506765.608	23542625.04

Department-Wise Total Sales.

Selecting a Random Department of the Top 5 For Analysis.

Top_5_Dept <- Store_Analysis %>% top_n(5)
DeptID <- sample(c(Top_5_Dept$Dept), 1)

Checking the New Dataset’s Top Rows.

##         Date Weekly_Sales
## 1 2010-02-05     195223.8
## 2 2010-02-12     170043.5
## 3 2010-02-19     164314.3
## 4 2010-02-26     147699.7
## 5 2010-03-05     169171.2
## 6 2010-03-12     161433.3

Checking the New Dataset’s Summary Statistics.

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 50	Pctl. 75	Max
Store	78	20	0	20	20	20	20	20
Dept	78	45.949	29.864	1	21.25	41.5	73.5	99
Avg_Store_Sales	78	27071.845	33651.788	9.667	4387.255	13941.789	38508.85	164633.742
Tot_Sales	78	3864120.275	4816272.294	29	624238.695	1977692.54	5506765.608	23542625.04

Plotting the Time Series of the Weekly Sales

ggplot(Store_dataf, aes(x=Date, y=Weekly_Sales)) +
  geom_line(color="brown") + 
  geom_smooth(method=lm) +
  xlab("Weeks of the Year") +
  ylab("Weekly Sales")

Fitting a Linear Regression Model on Weekly Sales on Dates

Model1 = lm(Weekly_Sales ~Date)
summary(Model1)

## 
## Call:
## lm(formula = Weekly_Sales ~ Date)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45977 -14521    328  11815  50841 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.237e+05  8.152e+04  -2.744  0.00686 ** 
## Date         2.565e+01  5.383e+00   4.764 4.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18600 on 141 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1326 
## F-statistic:  22.7 on 1 and 141 DF,  p-value: 4.659e-06

Assessing the behavior/data-generating process of the time-series

Starting by checking whether the Series is Stationary:

ADF TEST:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_dataf$Weekly_Sales
## Dickey-Fuller = -2.4672, Lag order = 5, p-value = 0.3819
## alternative hypothesis: stationary

As expected the Data seems to be a Non-Stationary series

As the data has variance non-stationary, We must make the Variance Stationary.

Log Transforming it to make it stationary

plot1 <- ggplot(Store_dataf, aes(x=Date, y=log1p(Weekly_Sales))) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Log Transformed Weekly Sales") +
  theme_minimal()

# Now let us apply Box-Cox Transformation to it and check.

# Box Cox Transforming it to make it stationary
Store_dataf_box = Store_dataf %>%
                  mutate(close_boxcox = forecast::BoxCox(Store_dataf$Weekly_Sales, 
                lambda = 'auto'))

plot2 <- ggplot(Store_dataf_box, aes(x=Date, y=close_boxcox)) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Close Box") +
  theme_minimal()

Comparing Both the plots.

As can be seen that both the transformations, have resulted in a similar graph with a stationary variance. Thus for it’s least complexity going ahead with the Log Transformation, as in the plot above.

KPSS Test.

## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_dataf$Weekly_Sales
## KPSS Level = 1.2585, Truncation lag parameter = 4, p-value = 0.01

As the mean isn’t stationary, we will try to make it stationary using Differencing.

First Order Differencing

# First Order Differencing
Store_log_1d <- Store_dataf %>%
                mutate(Sale_log_1d = log1p(Weekly_Sales)-lag(log1p(Weekly_Sales)))

Let us now plot the New Stationary plot(With First Order Differencing) in comparison to the previous plot.

plot1_1d <- ggplot(Store_log_1d, aes(x=Date, y=Sale_log_1d)) +
  geom_line(color="brown") + 
    xlab("Weeks") +
    ylab("Log of Weekly Sales") +
    theme_minimal() 

#Comparing the Stationary plot with the previous plot
plot1 / plot1_1d

The plot looks pretty stationary now, in comparison.

#Checking the Stationary Test
Store_log_1d <- Store_log_1d %>% filter(Date>'2010-02-05')

ADF Test

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_log_1d$Sale_log_1d
## Dickey-Fuller = -6.1492, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

KPSS Test

## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_log_1d$Sale_log_1d
## KPSS Level = 0.060025, Truncation lag parameter = 4, p-value = 0.1

The ADF and KPSS Tests Suggests that the Data is now stationary.

Visually an yearly seasonality has been observed in the data.

Analyzing the time series through ACF/PACF Plots

ACF PLOT

PACF PLOT

Looking at the The Lag, it appears to be of AutoRegression - AR(4). Let us analyze further.

SECTION - 2

Testing AIC And BIC for different ARIMA Models.

AIC(
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)

##                                                     df       AIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -284.2964
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -282.3171
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -284.7833
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -286.9875
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -273.6001
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -281.2913
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -286.6870
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -285.0091

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2)

BIC(
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)

##                                                     df       BIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -266.5614
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -261.6263
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -261.1367
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -260.3851
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -258.8209
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -263.5564
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -265.9962
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -261.3624

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)

Also using auto.arima to get the ARIMA values.

model1 <- auto.arima(Store_log_1d$Sale_log_1d,stationary=FALSE,allowdrift=FALSE, seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

Plotting the residuals for model 1 to see whether they are white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2) with zero mean
## Q* = 2.0442, df = 5, p-value = 0.843
## 
## Model df: 5.   Total lags used: 10

The Auto-Arima also suggests using 3,0,2.

# Checking residuals for 4,0,0
model2 <- arima(Store_log_1d$Sale_log_1d,order=c(4,0,0))

Plotting the residuals for model 1 to see whether they are white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,0) with non-zero mean
## Q* = 5.1099, df = 5, p-value = 0.4026
## 
## Model df: 5.   Total lags used: 10

Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation.

# ACF and PACF Plots for the Residuals.
best_model <- model1
par(mfrow=c(1,2))
resid = best_model$residuals
acf(resid,lag.max=10)
pacf(resid,lag.max=10)

The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.

Conducting the Box-Ljung Test to verify that there is no autocorrelation.

Ljung-Box test

Lag = 1

Box.test(resid,type='Ljung-Box',lag=1)

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.0039211, df = 1, p-value = 0.9501

Lag = 5

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.97357, df = 5, p-value = 0.9647

Lag = 10

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.0442, df = 10, p-value = 0.996

Lag = 20

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 9.7661, df = 20, p-value = 0.9722

Plotting the Prediction to the Actual Sales Data.

# Adding Residuals to the Sales data, to find the Predictions
pred = resid+Store_log_1d$Sale_log_1d

ggplot()+
  geom_line(aes(Store_log_1d$Date,Store_log_1d$Sale_log_1d))+
  geom_line(aes(Store_log_1d$Date,pred),color='brown',alpha=0.6)+
  xlab("Weeks")+
  ylab("Weekly Sales (Log Transformed)") +
  theme_minimal()

# The in-sample RMSE of the model based on the residuals.
RMSE = sqrt(mean((expm1(pred) - expm1(Store_log_1d$Sale_log_1d))^2,na.rm=T))
RMSE

## [1] 0.08233186

Forecast for 5 Time-Periods.

best_model %>%
  forecast(h=5) %>% 
  autoplot()

As the Forecast for 5 Time-Periods is pretty small, Forecasting for 20 time-periods, as in 20 weeks ahead.

Forecast for 20 Time-Periods.

The forecast here does go along with the actual data, like always, the pattern is pretty consistent, although we see the forecast, doesn’t account for the seasonality that well. And it starts diminishing towards Sales Log of 0.

Forecasting Assignment 2

Maulik Patel

05/02/2022

Introduction

Goal

Note

Importing useful Libraries for EDA like:

1. dplyr

2. ggplot2

3. forecast

4. vtable

5. tidyverse

Taking Input File

Luckily, there aren’t any Null values in the dataset, but on a closer look, Negative Values are visible in the Weekly Sales.

Finding the Number of Negative Values.

Removing Null(NA) Values.

The dataset is cleaned now.

Let us now look for the effect of Holidays on the Store Sales

Inferences from the plot:

1. As expected, the factor of it being a holiday affects the weekly sales for all the stores.

2. As all the stores are plotted here, it can be seen that some stores numbered 10-20, seem to have higher weekly sales as compared to other stores.

Let us now look for the effect of Holidays on the Department Sales

Inferences from the plot:

1. A drastic difference in sales for different departments is visible.

2. Departments like 12, and 70, show a huge spike in sales, of which department 12 seems to have no effect due to holiday, but department 70 seems to have a huge impact of a holiday.

Using Boxplots to look for any potential Outliers.

Both of the boxplots have outliers, with not a big difference in Sales, for both Holday and Normal Day

Let us plot the Average Store Sales, in order to Select the Top 5

The Top Stores, as per their Average sales are: 20,4,14,13,2

Also Checking for Total Sales

Even for Total Sales, The Top Stores remain: 20,4,14,13,2

Selecting a Random Store out of the top 5

Summary Statistics of the Selected Store

Department-Wise Total Sales.

Selecting a Random Department of the Top 5 For Analysis.

Checking the New Dataset’s Top Rows.

Checking the New Dataset’s Summary Statistics.

Plotting the Time Series of the Weekly Sales

Fitting a Linear Regression Model on Weekly Sales on Dates

Assessing the behavior/data-generating process of the time-series

Starting by checking whether the Series is Stationary:

ADF TEST:

As the data has variance non-stationary, We must make the Variance Stationary.

Log Transforming it to make it stationary

As can be seen that both the transformations, have resulted in a similar graph with a stationary variance. Thus for it’s least complexity going ahead with the Log Transformation, as in the plot above.

KPSS Test.

First Order Differencing

Let us now plot the New Stationary plot(With First Order Differencing) in comparison to the previous plot.

The plot looks pretty stationary now, in comparison.

ADF Test

KPSS Test

The ADF and KPSS Tests Suggests that the Data is now stationary.

Visually an yearly seasonality has been observed in the data.

Analyzing the time series through ACF/PACF Plots

ACF PLOT

PACF PLOT

Looking at the The Lag, it appears to be of AutoRegression - AR(4). Let us analyze further.

SECTION - 2

Testing AIC And BIC for different ARIMA Models.

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2)

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)

Also using auto.arima to get the ARIMA values.

Plotting the residuals for model 1 to see whether they are white noise.

Plotting the residuals for model 1 to see whether they are white noise.

Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation.

The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.

Conducting the Box-Ljung Test to verify that there is no autocorrelation.

Ljung-Box test

Plotting the Prediction to the Actual Sales Data.

Forecast for 5 Time-Periods.

Forecast for 20 Time-Periods.