Timeseries Model on Forecasting the Sales of the Retail

1.Introduction

1.1. About the dataset

This dataset reflects a retail shop transaction from 2001-01-01 - 2012-12-01. This dataset contains the information of each sales made such as the order date, the ship date, customer id etc. I collected this dataset from Algoritma Data Science School.

Columns of the datasets are:

Row.ID : The unique id of each row of transaction
Order.ID : Unique id of each order
Order.Date : The day the order is made
Ship.Date : The day the order is shipped
Ship.Mode : Mode of shipping, chosen by customer
Customer.ID : Customer’s unique id
Segment : The type of customers (e.g. Consumer, Corporate, etc)
Product.ID : Each product’s id that is purchased
Category : Product category
Sub.Category : Subcategory of the products
Product.Name : Product name
Sales : Amount of sales of the item purchased
Quantity : Quantity of the item purchased
Discount : Discount entitled to the customer
Profit : Gross profit made on the transaction

1.2. About time series analysis

Time series analyzes and processes data that are affected by a sequence or a series of time. Time series analysis does value predictions called forecasting. Each of the data that is processed should be converted into a time series object that have 3 main characteristics such as: a. No missing intervals b. No missing values c. Data has to be in the right sequence of time

1.3. What we need to do

We need to build a model that forecast the sales for the next 12 days after the data

2.Data Loading and Preprocessing

2.1. Load library

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

2.2. Load data

# Read data and remove NA
retail <- read.csv("retail.csv") %>%
          mutate(Order.Date = ymd(Order.Date)) %>% 
          drop_na()

## Warning: 5952 failed to parse.

# Create a new column to just the month and the year of the Order.Date
retail$month_year_order <- format(as.Date(retail$Order.Date), "%Y/%m")
retail$month_year_order <- ym(retail$month_year_order)
head(retail)

##   Row.ID       Order.ID Order.Date Ship.Date      Ship.Mode Customer.ID
## 1      1 CA-2016-152156 2011-08-16  11/11/16   Second Class    CG-12520
## 2      2 CA-2016-152156 2011-08-16  11/11/16   Second Class    CG-12520
## 3      3 CA-2016-138688 2006-12-16   6/16/16   Second Class    DV-13045
## 4      4 US-2015-108966 2010-11-15  10/18/15 Standard Class    SO-20335
## 5      5 US-2015-108966 2010-11-15  10/18/15 Standard Class    SO-20335
## 6      6 CA-2014-115812 2006-09-14   6/14/14 Standard Class    BH-11710
##     Segment      Product.ID        Category Sub.Category
## 1  Consumer FUR-BO-10001798       Furniture    Bookcases
## 2  Consumer FUR-CH-10000454       Furniture       Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies       Labels
## 4  Consumer FUR-TA-10000577       Furniture       Tables
## 5  Consumer OFF-ST-10000760 Office Supplies      Storage
## 6  Consumer FUR-FU-10001487       Furniture  Furnishings
##                                                       Product.Name    Sales
## 1                                Bush Somerset Collection Bookcase 261.9600
## 2      Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3        Self-Adhesive Address Labels for Typewriters by Universal  14.6200
## 4                    Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5                                   Eldon Fold 'N Roll Cart System  22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood  48.8600
##   Quantity Discount    Profit month_year_order
## 1        2     0.00   41.9136       2011-08-01
## 2        3     0.00  219.5820       2011-08-01
## 3        2     0.00    6.8714       2006-12-01
## 4        5     0.45 -383.0310       2010-11-01
## 5        2     0.20    2.5164       2010-11-01
## 6        7     0.00   14.1694       2006-09-01

2.3. Inspect and change data type (if necessary)

# Check data type
str(retail)

## 'data.frame':    4042 obs. of  16 variables:
##  $ Row.ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Order.ID        : chr  "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
##  $ Order.Date      : Date, format: "2011-08-16" "2011-08-16" ...
##  $ Ship.Date       : chr  "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
##  $ Ship.Mode       : chr  "Second Class" "Second Class" "Second Class" "Standard Class" ...
##  $ Customer.ID     : chr  "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
##  $ Segment         : chr  "Consumer" "Consumer" "Corporate" "Consumer" ...
##  $ Product.ID      : chr  "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
##  $ Category        : chr  "Furniture" "Furniture" "Office Supplies" "Furniture" ...
##  $ Sub.Category    : chr  "Bookcases" "Chairs" "Labels" "Tables" ...
##  $ Product.Name    : chr  "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
##  $ Sales           : num  262 731.9 14.6 957.6 22.4 ...
##  $ Quantity        : int  2 3 2 5 2 7 4 6 3 5 ...
##  $ Discount        : num  0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
##  $ Profit          : num  41.91 219.58 6.87 -383.03 2.52 ...
##  $ month_year_order: Date, format: "2011-08-01" "2011-08-01" ...

We just need to change the Order.Date to date type and make a new column month_year_order which takes the month and the year of Order.Date

2.4. Check for null values and

# Check for null values
retail %>% 
  is.na() %>% 
  colSums()

##           Row.ID         Order.ID       Order.Date        Ship.Date 
##                0                0                0                0 
##        Ship.Mode      Customer.ID          Segment       Product.ID 
##                0                0                0                0 
##         Category     Sub.Category     Product.Name            Sales 
##                0                0                0                0 
##         Quantity         Discount           Profit month_year_order 
##                0                0                0                0

2.5. Data aggregation

# Take the sums of sales on each month to be analyzed
retail_clean <- retail %>% 
                  group_by(month_year_order) %>% 
                  summarise(Total = sum(Sales)) %>% 
                  select(month_year_order, Total)
range(retail_clean$month_year_order)

## [1] "2001-01-01" "2012-12-01"

# Check and pad the clean data if necessary
retail_clean %>% 
  padr::pad() %>% 
  is.na() %>% 
  colSums()

## pad applied on the interval: month

## month_year_order            Total 
##                0                0

3.Exploratory Data Analysis

# Make a TS object 
retail_ts <- ts(retail_clean$Total, start = c(2001,1,1), frequency = 12)
autoplot(retail_ts, series = "Actual Sales")

# Decompose
retail_dec <- decompose(retail_ts, type = "additive")
autoplot(retail_dec)

The data here is seasonal
There is an upward trend
From the trend, there seems to be minimal ups and downs pattern which indicates that there is no multiseasonal tiem series object.

4.Cross Validation

retail_test <- tail(retail_ts, 12)
retail_train <- head(retail_ts, 132)

5.Forecast Model Building

5.1. SARIMA

Before building the model, we have to check if the data is stationery or random

# Check if the data is stationary
retail_train %>% 
  adf.test()

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -3.4109, Lag order = 5, p-value = 0.05583
## alternative hypothesis: stationary

The data is slightly random since the p-value is slightly bigger than alpha (0.05), thus we have to do differencing

retail_train %>% 
  diff() %>% 
  adf.test()

## Warning in adf.test(.): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -8.0822, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary

the p-value is already smalller than alpha, thus it is no longer random

# Check for differentation values
retail_train %>% 
  diff() %>% 
  tsdisplay()

model_sarima <- auto.arima(y = retail_train, seasonal = T)
autoplot(model_sarima$fitted, series = "Sarima") +
  autolayer(retail_train, series = "Actual")

# Check for AIC values
model_sarima$aic

## [1] 2553.371

5.2. Holtz-Winter

model_holtz <- HoltWinters(x = retail_train, alpha = 0.9, beta = 0.1, gamma = F)
autoplot(model_holtz$fitted[,1], series = "hw") +
  autolayer(retail_train, series = "Actual")

6.Model Evaluation

model_hw_forecast <- forecast(model_holtz, h = 12)
model_sarima_forecast <- forecast(model_sarima, h = 12)

#Plot the result of forecasting
retail_train %>% 
  autoplot() +
  autolayer(model_sarima_forecast, series = "SARIMA Forecast") +
  autolayer(retail_test, series = "data test")

retail_train %>% 
  autoplot() +
  autolayer(model_hw_forecast$mean, series = "HW Forecast") +
  autolayer(retail_test, series = "data test")

From here we can see that the SARIMA Forecasting result is better than the HW forecast # 7.Assumption Check ## 7.1. Normality For this check, we need to see if the residuals are normally distributed
H0: residuals are normally distributed
H1: residuals are not normally distributed

# Saphiro test
shapiro.test(model_sarima_forecast$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_sarima_forecast$residuals
## W = 0.88825, p-value = 1.596e-08

hist(model_sarima_forecast$residuals, breaks = 20)

Since p-value < 0.05, we have to reject H0 and thus the residuals are not distributed normally
We can also look at the histogram to see if they are evenly distributed or no

7.2. Auto-correlations

For this check, we need to see if there is autocorrelation in the forecast error * H0: No autocorrelation in the forecast error * H1: There is autocorrelation in the forecast error

Box.test(model_sarima_forecast$residuals, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  model_sarima_forecast$residuals
## X-squared = 0.48013, df = 1, p-value = 0.4884

Since p-value is > 0.05 we can say that there is no autocorrelation in the forecast error.

But we have to inspect it through a lineplot

plot(model_sarima_forecast$residuals)

Through the lineplot, we can see that the distribution is not even, thus L-jung box confirmation is not valid. This usually happens due to unpredictable events such as natural disasters, pandemic, economic crash, etc.

8.Conclusion

From the 2 models performed above, we can see that the SARIMA performs better than the Holtz Winter. We can safely assume that there is no autocorrelation in the residuals for that model. However, there are some improvements that can be made for this analyisis such as using more models and fine tune each of the model parameters.