Introduction
A Kaggle competition Data set consisting of the Weekly Sales Data of Walmart, was taken for analysis. It consists of the historical sales data for 45 of Walmart stores across the country. It’s dates are spread for almost 3 years for the period between 2010 and 2012. The dataset has mentioned the last date of the week of the Week (i.e. Friday), Store Number, Department Number, and whether the week is a holiday week or no.
Goal
As the dataset consists of multiple stores and it’s multiple departments, our goal is to choose one store and department at random, and to analyse it.
Note
While working on this retail dataset, few things to expect based on the intuition. Seasonality can be expected, because of the yearly holidays, etc. From the basic EDA, it seems like the dataset may contain some erroneous values, and hence they must be cleaned before starting with the time-series forecasting.Also the time-series forecasting would be done on a single store and to one of it’s top departments, chosen at random.
Importing useful Libraries for EDA like:
1. dplyr
2. ggplot2
3. forecast
4. vtable
5. tidyverse
Taking Input File
#Changing Working Directory to the required output
setwd("C:/Users/MauP/OneDrive/Desktop/Spring Semester/Flex 1/Forecasting/Project Data - Walmart Sales")
#Reading in the CSV File into the DataFrame
=read.csv("Weekly_Sales.csv") dataf
Let us see the number of records of the dataset:
## [1] 421570
Let us now see, how the data looks like, by seeing the top 5 rows.
## Store Dept Date Weekly_Sales IsHoliday
## 1 1 1 2010-02-05 24924.50 FALSE
## 2 1 1 2010-02-12 46039.49 TRUE
## 3 1 1 2010-02-19 41595.55 FALSE
## 4 1 1 2010-02-26 19403.54 FALSE
## 5 1 1 2010-03-05 21827.90 FALSE
## 6 1 1 2010-03-12 21043.39 FALSE
Looking for the Structure of the Table
## 'data.frame': 421570 obs. of 5 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" ...
## $ Weekly_Sales: num 24925 46039 41596 19404 21828 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
Converting the Date’s DataType to ‘Date’.
#Converting the DataType of the Date Column
$Date <- as.Date(dataf$Date) dataf
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|---|
Store | 421570 | 22.201 | 12.785 | 1 | 11 | 22 | 33 | 45 |
Dept | 421570 | 44.26 | 30.492 | 1 | 18 | 37 | 74 | 99 |
Weekly_Sales | 421570 | 15981.258 | 22711.184 | -4988.94 | 2079.65 | 7612.03 | 20205.853 | 693099.36 |
IsHoliday | 421570 | |||||||
… No | 391909 | 93% | ||||||
… Yes | 29661 | 7% |
Luckily, there aren’t any Null values in the dataset, but on a closer look, Negative Values are visible in the Weekly Sales.
Finding the Number of Negative Values.
## [1] 1285
There are 1285 rows of negative values out of a data of 42157042 rows, which is just 0.3% of the entire dataset, and hence can be truncated.
Removing Null(NA) Values.
<- dataf[dataf$Weekly_Sales>0,] dataf
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|---|
Store | 420212 | 22.196 | 12.787 | 1 | 11 | 22 | 33 | 45 |
Dept | 420212 | 44.241 | 30.509 | 1 | 18 | 37 | 74 | 99 |
Weekly_Sales | 420212 | 16033.115 | 22729.492 | 0.01 | 2120.13 | 7661.7 | 20271.265 | 693099.36 |
IsHoliday | 420212 | |||||||
… No | 390652 | 93% | ||||||
… Yes | 29560 | 7% |
The dataset is cleaned now.
Let us now look for the effect of Holidays on the Store Sales
#Exploring the effect of Holiday
ggplot(dataf, aes(x = Store, y = Weekly_Sales))+
geom_jitter(aes(colour = IsHoliday)) +
xlab('Store') +
ylab('Sales Per Week') +
ggtitle("Store Weekly Sales - Comparing the Effect of Holiday") +
theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
)
Inferences from the plot:
1. As expected, the factor of it being a holiday affects the weekly sales for all the stores.
2. As all the stores are plotted here, it can be seen that some stores numbered 10-20, seem to have higher weekly sales as compared to other stores.
Let us now look for the effect of Holidays on the Department Sales
ggplot(dataf, aes(x=Dept, y=Weekly_Sales)) +
geom_jitter(aes(colour = IsHoliday)) +
xlab('Deptartment Number') +
ylab('Sales Per Week') +
ggtitle("Department Sales - Comparing the Effect of Holiday ") +
theme(
legend.position = c(.95, .95),
legend.justification = c("right", "top"),
legend.box.just = "right",
legend.margin = margin(6, 6, 6, 6)
)
Inferences from the plot:
1. A drastic difference in sales for different departments is visible.
2. Departments like 12, and 70, show a huge spike in sales, of which department 12 seems to have no effect due to holiday, but department 70 seems to have a huge impact of a holiday.
Using Boxplots to look for any potential Outliers.
boxplot(dataf$Weekly_Sales ~ dataf$IsHoliday, dataf,
main='Weekly Sales Comparison - Holidays vs Normal Days',
xlab = 'Holiday, True or False',
ylab='Weekly Sales',
frame = TRUE,
horizontal = FALSE,
color = "green")
Both of the boxplots have outliers, with not a big difference in Sales, for both Holday and Normal Day
Let us plot the Average Store Sales, in order to Select the Top 5
#Plotting the Stores with the highest Average Store Sales, to select the one of the top 5 stores
ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Avg_Store_Sales), y=Avg_Store_Sales)) +
geom_bar(stat="identity", fill="brown") +
xlab('Store') +
ylab('Average Store Sales')
The Top Stores, as per their Average sales are: 20,4,14,13,2
Also Checking for Total Sales
<- dum_dataf %>% arrange(desc(Tot_Sales))
dum_dataf ggplot(data=dum_dataf, aes(x=reorder(factor(Store), -Tot_Sales), y=Tot_Sales)) +
geom_bar(stat="identity", fill="darkgreen") +
xlab('Store') +
ylab('Total Store Sales')
Even for Total Sales, The Top Stores remain: 20,4,14,13,2
Selecting a Random Store out of the top 5
set.seed(70500)
<- sample(c(20,4,14,13,2), 1)
StoreID StoreID
## [1] 20
# Choosing a random store for analysis.
<- dataf %>%
Store_Analysis filter(Store==StoreID) %>%
group_by(Store, Dept) %>%
summarise(Avg_Store_Sales=mean(Weekly_Sales),
Tot_Sales=sum(Weekly_Sales))
# Plotting the Head of the Store_Analysis
head(Store_Analysis)
## # A tibble: 6 x 4
## # Groups: Store [1]
## Store Dept Avg_Store_Sales Tot_Sales
## <int> <int> <dbl> <dbl>
## 1 20 1 40545. 5798003.
## 2 20 2 78251. 11189929.
## 3 20 3 15491. 2215209.
## 4 20 4 51456. 7358262.
## 5 20 5 41648. 5955633.
## 6 20 6 8211. 1174137.
Summary Statistics of the Selected Store
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|---|
Store | 78 | 20 | 0 | 20 | 20 | 20 | 20 | 20 |
Dept | 78 | 45.949 | 29.864 | 1 | 21.25 | 41.5 | 73.5 | 99 |
Avg_Store_Sales | 78 | 27071.845 | 33651.788 | 9.667 | 4387.255 | 13941.789 | 38508.85 | 164633.742 |
Tot_Sales | 78 | 3864120.275 | 4816272.294 | 29 | 624238.695 | 1977692.54 | 5506765.608 | 23542625.04 |
Department-Wise Total Sales.
Selecting a Random Department of the Top 5 For Analysis.
<- Store_Analysis %>% top_n(5)
Top_5_Dept <- sample(c(Top_5_Dept$Dept), 1) DeptID
Checking the New Dataset’s Top Rows.
## Date Weekly_Sales
## 1 2010-02-05 195223.8
## 2 2010-02-12 170043.5
## 3 2010-02-19 164314.3
## 4 2010-02-26 147699.7
## 5 2010-03-05 169171.2
## 6 2010-03-12 161433.3
Checking the New Dataset’s Summary Statistics.
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 50 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|---|
Store | 78 | 20 | 0 | 20 | 20 | 20 | 20 | 20 |
Dept | 78 | 45.949 | 29.864 | 1 | 21.25 | 41.5 | 73.5 | 99 |
Avg_Store_Sales | 78 | 27071.845 | 33651.788 | 9.667 | 4387.255 | 13941.789 | 38508.85 | 164633.742 |
Tot_Sales | 78 | 3864120.275 | 4816272.294 | 29 | 624238.695 | 1977692.54 | 5506765.608 | 23542625.04 |
Plotting the Time Series of the Weekly Sales
ggplot(Store_dataf, aes(x=Date, y=Weekly_Sales)) +
geom_line(color="brown") +
geom_smooth(method=lm) +
xlab("Weeks of the Year") +
ylab("Weekly Sales")
Fitting a Linear Regression Model on Weekly Sales on Dates
= lm(Weekly_Sales ~Date)
Model1 summary(Model1)
##
## Call:
## lm(formula = Weekly_Sales ~ Date)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45977 -14521 328 11815 50841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.237e+05 8.152e+04 -2.744 0.00686 **
## Date 2.565e+01 5.383e+00 4.764 4.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18600 on 141 degrees of freedom
## Multiple R-squared: 0.1387, Adjusted R-squared: 0.1326
## F-statistic: 22.7 on 1 and 141 DF, p-value: 4.659e-06
Assessing the behavior/data-generating process of the time-series
Starting by checking whether the Series is Stationary:
ADF TEST:
##
## Augmented Dickey-Fuller Test
##
## data: Store_dataf$Weekly_Sales
## Dickey-Fuller = -2.4672, Lag order = 5, p-value = 0.3819
## alternative hypothesis: stationary
As expected the Data seems to be a Non-Stationary series
As the data has variance non-stationary, We must make the Variance Stationary.
Log Transforming it to make it stationary
<- ggplot(Store_dataf, aes(x=Date, y=log1p(Weekly_Sales))) +
plot1 geom_line(color="brown") +
xlab("Weeks") +
ylab("Log Transformed Weekly Sales") +
theme_minimal()
# Now let us apply Box-Cox Transformation to it and check.
# Box Cox Transforming it to make it stationary
= Store_dataf %>%
Store_dataf_box mutate(close_boxcox = forecast::BoxCox(Store_dataf$Weekly_Sales,
lambda = 'auto'))
<- ggplot(Store_dataf_box, aes(x=Date, y=close_boxcox)) +
plot2 geom_line(color="brown") +
xlab("Weeks") +
ylab("Close Box") +
theme_minimal()
Comparing Both the plots.
As can be seen that both the transformations, have resulted in a similar graph with a stationary variance. Thus for it’s least complexity going ahead with the Log Transformation, as in the plot above.
KPSS Test.
##
## KPSS Test for Level Stationarity
##
## data: Store_dataf$Weekly_Sales
## KPSS Level = 1.2585, Truncation lag parameter = 4, p-value = 0.01
As the mean isn’t stationary, we will try to make it stationary using Differencing.
First Order Differencing
# First Order Differencing
<- Store_dataf %>%
Store_log_1d mutate(Sale_log_1d = log1p(Weekly_Sales)-lag(log1p(Weekly_Sales)))
Let us now plot the New Stationary plot(With First Order Differencing) in comparison to the previous plot.
<- ggplot(Store_log_1d, aes(x=Date, y=Sale_log_1d)) +
plot1_1d geom_line(color="brown") +
xlab("Weeks") +
ylab("Log of Weekly Sales") +
theme_minimal()
#Comparing the Stationary plot with the previous plot
/ plot1_1d plot1
The plot looks pretty stationary now, in comparison.
#Checking the Stationary Test
<- Store_log_1d %>% filter(Date>'2010-02-05') Store_log_1d
ADF Test
##
## Augmented Dickey-Fuller Test
##
## data: Store_log_1d$Sale_log_1d
## Dickey-Fuller = -6.1492, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
KPSS Test
##
## KPSS Test for Level Stationarity
##
## data: Store_log_1d$Sale_log_1d
## KPSS Level = 0.060025, Truncation lag parameter = 4, p-value = 0.1
The ADF and KPSS Tests Suggests that the Data is now stationary.
Visually an yearly seasonality has been observed in the data.
Analyzing the time series through ACF/PACF Plots
ACF PLOT
PACF PLOT
Looking at the The Lag, it appears to be of AutoRegression - AR(4). Let us analyze further.
SECTION - 2
Testing AIC And BIC for different ARIMA Models.
AIC(
arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)
## df AIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0)) 6 -284.2964
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1)) 7 -282.3171
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2)) 8 -284.7833
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3)) 9 -286.9875
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0)) 5 -273.6001
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1)) 6 -281.2913
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2)) 7 -286.6870
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3)) 8 -285.0091
Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,3) or ARIMA(3,0,2)
BIC(
arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,1)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,2)),
arima(Store_log_1d$Sale_log_1d,order=c(4,0,3)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,0)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,1)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,2)),
arima(Store_log_1d$Sale_log_1d,order=c(3,0,3))
)
## df BIC
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 0)) 6 -266.5614
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 1)) 7 -261.6263
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 2)) 8 -261.1367
## arima(Store_log_1d$Sale_log_1d, order = c(4, 0, 3)) 9 -260.3851
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 0)) 5 -258.8209
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 1)) 6 -263.5564
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 2)) 7 -265.9962
## arima(Store_log_1d$Sale_log_1d, order = c(3, 0, 3)) 8 -261.3624
Based on the AIC values, the best suggested ARIMA is ARIMA(4,0,0) or ARIMA(3,0,2)
Also using auto.arima to get the ARIMA values.
<- auto.arima(Store_log_1d$Sale_log_1d,stationary=FALSE,allowdrift=FALSE, seasonal=FALSE,stepwise=FALSE,approximation=FALSE) model1
Plotting the residuals for model 1 to see whether they are white noise.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2) with zero mean
## Q* = 2.0442, df = 5, p-value = 0.843
##
## Model df: 5. Total lags used: 10
The Auto-Arima also suggests using 3,0,2.
# Checking residuals for 4,0,0
<- arima(Store_log_1d$Sale_log_1d,order=c(4,0,0)) model2
Plotting the residuals for model 1 to see whether they are white noise.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(4,0,0) with non-zero mean
## Q* = 5.1099, df = 5, p-value = 0.4026
##
## Model df: 5. Total lags used: 10
Analysing the Residuals using the Ljung-Box test, clearly the ARIMA(3,0,2) model shows better characteristics of almost no autocorrelation.
# ACF and PACF Plots for the Residuals.
<- model1
best_model par(mfrow=c(1,2))
= best_model$residuals
resid acf(resid,lag.max=10)
pacf(resid,lag.max=10)
The ACF and PACF Plots of the Residuals, do not show any correlation, and are also showing characteristics of a white noise.
Conducting the Box-Ljung Test to verify that there is no autocorrelation.
Ljung-Box test
Lag = 1
Box.test(resid,type='Ljung-Box',lag=1)
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.0039211, df = 1, p-value = 0.9501
Lag = 5
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.97357, df = 5, p-value = 0.9647
Lag = 10
##
## Box-Ljung test
##
## data: resid
## X-squared = 2.0442, df = 10, p-value = 0.996
Lag = 20
##
## Box-Ljung test
##
## data: resid
## X-squared = 9.7661, df = 20, p-value = 0.9722
Plotting the Prediction to the Actual Sales Data.
# Adding Residuals to the Sales data, to find the Predictions
= resid+Store_log_1d$Sale_log_1d
pred
ggplot()+
geom_line(aes(Store_log_1d$Date,Store_log_1d$Sale_log_1d))+
geom_line(aes(Store_log_1d$Date,pred),color='brown',alpha=0.6)+
xlab("Weeks")+
ylab("Weekly Sales (Log Transformed)") +
theme_minimal()
# The in-sample RMSE of the model based on the residuals.
= sqrt(mean((expm1(pred) - expm1(Store_log_1d$Sale_log_1d))^2,na.rm=T))
RMSE RMSE
## [1] 0.08233186
Forecast for 5 Time-Periods.
%>%
best_model forecast(h=5) %>%
autoplot()
As the Forecast for 5 Time-Periods is pretty small, Forecasting for 20 time-periods, as in 20 weeks ahead.
Forecast for 20 Time-Periods.
The forecast here does go along with the actual data, like always, the pattern is pretty consistent, although we see the forecast, doesn’t account for the seasonality that well. And it starts diminishing towards Sales Log of 0.