Introduction

A Kaggle competition Data set consisting of the Weekly Sales Data of Walmart, was taken for analysis. It consists of the historical sales data for 45 of Walmart stores across the country. It’s dates are spread for almost 3 years for the period between 2010 and 2012. The dataset has mentioned the last date of the week of the Week (i.e. Friday), Store Number, Department Number, and whether the week is a holiday week or no.

Goal

As the dataset consists of multiple stores and it’s multiple departments, our goal is to choose one store and department at random, and to analyse it.

Note

While working on this retail dataset, few things to expect based on the intuition. Seasonality can be expected, because of the yearly holidays, etc. From the basic EDA, it seems like the dataset may contain some erroneous values, and hence they must be cleaned before starting with the time-series forecasting.Also the time-series forecasting would be done on a single store and to one of it’s top departments, chosen at random.

Importing useful Libraries for EDA like:
1. dplyr
2. ggplot2
3. forecast
4. vtable
5. tidyverse

Taking Input File

#Changing Working Directory to the required output
setwd("C:/Users/MauP/OneDrive/Desktop/Spring Semester/Flex 1/Forecasting/Project Data - Walmart Sales")

#Reading in the CSV File into the DataFrame
dataf=read.csv("Weekly_Sales.csv")

Let us see the number of records of the dataset:

## [1] 421570

Let us now see, how the data looks like, by seeing the top 5 rows.

##   Store Dept       Date Weekly_Sales IsHoliday
## 1     1    1 2010-02-05     24924.50     FALSE
## 2     1    1 2010-02-12     46039.49      TRUE
## 3     1    1 2010-02-19     41595.55     FALSE
## 4     1    1 2010-02-26     19403.54     FALSE
## 5     1    1 2010-03-05     21827.90     FALSE
## 6     1    1 2010-03-12     21043.39     FALSE

Looking for the Structure of the Table

## 'data.frame':    421570 obs. of  5 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" ...
##  $ Weekly_Sales: num  24925 46039 41596 19404 21828 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...

Converting the Date’s DataType to ‘Date’.

#Converting the DataType of the Date Column 
dataf$Date <- as.Date(dataf$Date)
Let us see the Summary Statistics of the table.
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
Store 421570 22.201 12.785 1 11 22 33 45
Dept 421570 44.26 30.492 1 18 37 74 99
Weekly_Sales 421570 15981.258 22711.184 -4988.94 2079.65 7612.03 20205.853 693099.36
IsHoliday 421570
… No 391909 93%
… Yes 29661 7%
Luckily, there aren’t any Null values in the dataset, but on a closer look, Negative Values are visible in the Weekly Sales.
Finding the Number of Negative Values.
## [1] 1285

There are 1285 rows of negative values out of a data of 42157042 rows, which is just 0.3% of the entire dataset, and hence can be truncated.

Removing Null(NA) Values.
dataf <- dataf[dataf$Weekly_Sales>0,]
Let us see the Summary Statistics of the table again.
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
Store 420212 22.196 12.787 1 11 22 33 45
Dept 420212 44.241 30.509 1 18 37 74 99
Weekly_Sales 420212 16033.115 22729.492 0.01 2120.13 7661.7 20271.265 693099.36
IsHoliday 420212
… No 390652 93%
… Yes 29560 7%

The dataset is cleaned now.

Let us now look for the effect of Holidays on the Store Sales

#Exploring the effect of Holiday
ggplot(dataf, aes(x = Store, y = Weekly_Sales))+ 
  geom_jitter(aes(colour = IsHoliday)) + 
  xlab('Store') +
  ylab('Sales Per Week') +
  ggtitle("Store Weekly Sales - Comparing the Effect of Holiday") + 
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )

Inferences from the plot:

1. As expected, the factor of it being a holiday affects the weekly sales for all the stores.
2. As all the stores are plotted here, it can be seen that some stores numbered 10-20, seem to have higher weekly sales as compared to other stores.

Let us now look for the effect of Holidays on the Department Sales

ggplot(dataf, aes(x=Dept, y=Weekly_Sales)) + 
  geom_jitter(aes(colour = IsHoliday)) + 
  xlab('Deptartment Number') +
  ylab('Sales Per Week') +
  ggtitle("Department Sales - Comparing the Effect of Holiday ") +
  theme(
    legend.position = c(.95, .95),
    legend.justification = c("right", "top"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6)
  )

Inferences from the plot:

1. A drastic difference in sales for different departments is visible.
2. Departments like 12, and 70, show a huge spike in sales, of which department 12 seems to have no effect due to holiday, but department 70 seems to have a huge impact of a holiday.

Using Boxplots to look for any potential Outliers.

boxplot(dataf$Weekly_Sales ~ dataf$IsHoliday, dataf,
        main='Weekly Sales Comparison - Holidays vs Normal Days', 
        xlab = 'Holiday, True or False',
        ylab='Weekly Sales',
        frame = TRUE,
        horizontal = FALSE,
        color = "green")

Both of the boxplots have outliers, with not a big difference in Sales, for both Holday and Normal Day
Let us plot the Average Store Sales, in order to Select the Top 5
#Plotting the Stores with the highest Average Store Sales, to select the one of the top 5 stores
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Avg_Store_Sales), y=Avg_Store_Sales)) +
  geom_bar(stat="identity", fill="brown") +
  xlab('Store') +
  ylab('Average Store Sales')

The Top Stores, as per their Average sales are: 20,4,14,13,2
Also Checking for Total Sales
dum_dataf <- dum_dataf %>% arrange(desc(Tot_Sales))
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Tot_Sales), y=Tot_Sales)) +
  geom_bar(stat="identity", fill="darkgreen") +
  xlab('Store') +
  ylab('Total Store Sales')

Even for Total Sales, The Top Stores remain: 20,4,14,13,2
Selecting a Random Store out of the top 5
set.seed(70500)
StoreID <- sample(c(20,4,14,13,2), 1)
StoreID
## [1] 20
# Choosing a random store for analysis.
Store_Analysis <- dataf %>% 
                  filter(Store==StoreID) %>% 
                  group_by(Store, Dept) %>% 
                  summarise(Avg_Store_Sales=mean(Weekly_Sales), 
                  Tot_Sales=sum(Weekly_Sales))

# Plotting the Head of the Store_Analysis
head(Store_Analysis)
## # A tibble: 6 x 4
## # Groups:   Store [1]
##   Store  Dept Avg_Store_Sales Tot_Sales
##   <int> <int>           <dbl>     <dbl>
## 1    20     1          40545.  5798003.
## 2    20     2          78251. 11189929.
## 3    20     3          15491.  2215209.
## 4    20     4          51456.  7358262.
## 5    20     5          41648.  5955633.
## 6    20     6           8211.  1174137.
Summary Statistics of the Selected Store
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
Store 78 20 0 20 20 20 20 20
Dept 78 45.949 29.864 1 21.25 41.5 73.5 99
Avg_Store_Sales 78 27071.845 33651.788 9.667 4387.255 13941.789 38508.85 164633.742
Tot_Sales 78 3864120.275 4816272.294 29 624238.695 1977692.54 5506765.608 23542625.04
Department-Wise Total Sales.

Selecting a Random Department of the Top 5 For Analysis.

Top_5_Dept <- Store_Analysis %>% top_n(5)
DeptID <- sample(c(Top_5_Dept$Dept), 1)

Checking the New Dataset’s Top Rows.

##         Date Weekly_Sales
## 1 2010-02-05     195223.8
## 2 2010-02-12     170043.5
## 3 2010-02-19     164314.3
## 4 2010-02-26     147699.7
## 5 2010-03-05     169171.2
## 6 2010-03-12     161433.3

Checking the New Dataset’s Summary Statistics.

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 50 Pctl. 75 Max
Store 78 20 0 20 20 20 20 20
Dept 78 45.949 29.864 1 21.25 41.5 73.5 99
Avg_Store_Sales 78 27071.845 33651.788 9.667 4387.255 13941.789 38508.85 164633.742
Tot_Sales 78 3864120.275 4816272.294 29 624238.695 1977692.54 5506765.608 23542625.04

Plotting the Time Series of the Weekly Sales

ggplot(Store_dataf, aes(x=Date, y=Weekly_Sales)) +
  geom_line(color="brown") + 
  geom_smooth(method=lm) +
  xlab("Weeks of the Year") +
  ylab("Weekly Sales")

Fitting a Linear Regression Model on Weekly Sales on Dates

Model1 = lm(Weekly_Sales ~Date)
summary(Model1)
## 
## Call:
## lm(formula = Weekly_Sales ~ Date)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45977 -14521    328  11815  50841 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.237e+05  8.152e+04  -2.744  0.00686 ** 
## Date         2.565e+01  5.383e+00   4.764 4.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18600 on 141 degrees of freedom
## Multiple R-squared:  0.1387, Adjusted R-squared:  0.1326 
## F-statistic:  22.7 on 1 and 141 DF,  p-value: 4.659e-06
Assessing the behavior/data-generating process of the time-series
Starting by checking whether the Series is Stationary:

ADF TEST:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_dataf$Weekly_Sales
## Dickey-Fuller = -2.4672, Lag order = 5, p-value = 0.3819
## alternative hypothesis: stationary

As expected the Data seems to be a Non-Stationary series

As the data has variance non-stationary, We must make the Variance Stationary.

Log Transforming it to make it stationary

plot1 <- ggplot(Store_dataf, aes(x=Date, y=log1p(Weekly_Sales))) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Log Transformed Weekly Sales") +
  theme_minimal()

# Now let us apply Box-Cox Transformation to it and check.

# Box Cox Transforming it to make it stationary
Store_dataf_box = Store_dataf %>%
                  mutate(close_boxcox = forecast::BoxCox(Store_dataf$Weekly_Sales, 
                lambda = 'auto'))

plot2 <- ggplot(Store_dataf_box, aes(x=Date, y=close_boxcox)) +
  geom_line(color="brown") + 
  xlab("Weeks") +
  ylab("Close Box") +
  theme_minimal()

Comparing Both the plots.

As can be seen that both the transformations, have resulted in a similar graph with a stationary variance. Thus for it’s least complexity going ahead with the Log Transformation, as in the plot above.

KPSS Test.

## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_dataf$Weekly_Sales
## KPSS Level = 1.2585, Truncation lag parameter = 4, p-value = 0.01

As the mean isn’t stationary, we will try to make it stationary using Differencing.

First Order Differencing

# First Order Differencing
Store_log_1d <- Store_dataf %>%
                mutate(Sale_log_1d = log1p(Weekly_Sales)-lag(log1p(Weekly_Sales))) 

Let us now plot the New Stationary plot(With First Order Differencing) in comparison to the previous plot.

plot1_1d <- ggplot(Store_log_1d, aes(x=Date, y=Sale_log_1d)) +
  geom_line(color="brown") + 
    xlab("Weeks") +
    ylab("Log of Weekly Sales") +
    theme_minimal() 

#Comparing the Stationary plot with the previous plot
plot1 / plot1_1d 

The plot looks pretty stationary now, in comparison.

#Checking the Stationary Test
Store_log_1d <- Store_log_1d %>% filter(Date>'2010-02-05')
ADF Test
## 
##  Augmented Dickey-Fuller Test
## 
## data:  Store_log_1d$Sale_log_1d
## Dickey-Fuller = -6.1492, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
KPSS Test
## 
##  KPSS Test for Level Stationarity
## 
## data:  Store_log_1d$Sale_log_1d
## KPSS Level = 0.060025, Truncation lag parameter = 4, p-value = 0.1
The ADF and KPSS Tests Suggests that the Data is now stationary.
Visually an yearly seasonality has been observed in the data.

Analyzing the time series through ACF/PACF Plots

ACF PLOT

PACF PLOT

Looking at the The Lag, it appears to be of AutoRegression - AR(4). Let us analyze further.

SECTION - 2

Testing AIC And BIC for different ARIMA Models.

AIC(
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)
##                                                     df       AIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -284.2964
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -282.3171
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -284.7833
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -286.9875
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -273.6001
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -281.2913
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -286.6870
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -285.0091

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2)

BIC(
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
  arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)
##                                                     df       BIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0))  6 -266.5614
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1))  7 -261.6263
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2))  8 -261.1367
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3))  9 -260.3851
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0))  5 -258.8209
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1))  6 -263.5564
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2))  7 -265.9962
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3))  8 -261.3624

Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)

Also using auto.arima to get the ARIMA values.

model1 <- auto.arima(Store_log_1d$Sale_log_1d,stationary=FALSE,allowdrift=FALSE, seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

Plotting the residuals for model 1 to see whether they are white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,0,2) with zero mean
## Q* = 2.0442, df = 5, p-value = 0.843
## 
## Model df: 5.   Total lags used: 10

The Auto-Arima also suggests using 3,0,2.

# Checking residuals for 4,0,0
model2 <- arima(Store_log_1d$Sale_log_1d,order=c(4,0,0))

Plotting the residuals for model 1 to see whether they are white noise.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(4,0,0) with non-zero mean
## Q* = 5.1099, df = 5, p-value = 0.4026
## 
## Model df: 5.   Total lags used: 10

Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation.

# ACF and PACF Plots for the Residuals.
best_model <- model1
par(mfrow=c(1,2))
resid = best_model$residuals
acf(resid,lag.max=10)
pacf(resid,lag.max=10)

The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.

Conducting the Box-Ljung Test to verify that there is no autocorrelation.

Ljung-Box test

Lag = 1

Box.test(resid,type='Ljung-Box',lag=1)
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.0039211, df = 1, p-value = 0.9501

Lag = 5

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.97357, df = 5, p-value = 0.9647

Lag = 10

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.0442, df = 10, p-value = 0.996

Lag = 20

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 9.7661, df = 20, p-value = 0.9722

Plotting the Prediction to the Actual Sales Data.

# Adding Residuals to the Sales data, to find the Predictions
pred = resid+Store_log_1d$Sale_log_1d

ggplot()+
  geom_line(aes(Store_log_1d$Date,Store_log_1d$Sale_log_1d))+
  geom_line(aes(Store_log_1d$Date,pred),color='brown',alpha=0.6)+
  xlab("Weeks")+
  ylab("Weekly Sales (Log Transformed)") +
  theme_minimal() 

# The in-sample RMSE of the model based on the residuals.
RMSE = sqrt(mean((expm1(pred) - expm1(Store_log_1d$Sale_log_1d))^2,na.rm=T))
RMSE
## [1] 0.08233186

Forecast for 5 Time-Periods.

best_model %>%
  forecast(h=5) %>% 
  autoplot()

As the Forecast for 5 Time-Periods is pretty small, Forecasting for 20 time-periods, as in 20 weeks ahead.

Forecast for 20 Time-Periods.

The forecast here does go along with the actual data, like always, the pattern is pretty consistent, although we see the forecast, doesn’t account for the seasonality that well. And it starts diminishing towards Sales Log of 0.