Timeseries Model on Forecasting the Sales of the Retail
1.Introduction
1.1. About the dataset
This dataset reflects a retail shop transaction from 2001-01-01 - 2012-12-01. This dataset contains the information of each sales made such as the order date, the ship date, customer id etc. I collected this dataset from Algoritma Data Science School.
Columns of the datasets are:
- Row.ID : The unique id of each row of transaction
- Order.ID : Unique id of each order
- Order.Date : The day the order is made
- Ship.Date : The day the order is shipped
- Ship.Mode : Mode of shipping, chosen by customer
- Customer.ID : Customer’s unique id
- Segment : The type of customers (e.g. Consumer, Corporate, etc)
- Product.ID : Each product’s id that is purchased
- Category : Product category
- Sub.Category : Subcategory of the products
- Product.Name : Product name
- Sales : Amount of sales of the item purchased
- Quantity : Quantity of the item purchased
- Discount : Discount entitled to the customer
- Profit : Gross profit made on the transaction
1.2. About time series analysis
Time series analyzes and processes data that are affected by a sequence or a series of time. Time series analysis does value predictions called forecasting. Each of the data that is processed should be converted into a time series object that have 3 main characteristics such as: a. No missing intervals b. No missing values c. Data has to be in the right sequence of time
1.3. What we need to do
We need to build a model that forecast the sales for the next 12 days after the data
2.Data Loading and Preprocessing
2.1. Load library
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
2.2. Load data
# Read data and remove NA
retail <- read.csv("retail.csv") %>%
mutate(Order.Date = ymd(Order.Date)) %>%
drop_na()## Warning: 5952 failed to parse.
# Create a new column to just the month and the year of the Order.Date
retail$month_year_order <- format(as.Date(retail$Order.Date), "%Y/%m")
retail$month_year_order <- ym(retail$month_year_order)
head(retail)## Row.ID Order.ID Order.Date Ship.Date Ship.Mode Customer.ID
## 1 1 CA-2016-152156 2011-08-16 11/11/16 Second Class CG-12520
## 2 2 CA-2016-152156 2011-08-16 11/11/16 Second Class CG-12520
## 3 3 CA-2016-138688 2006-12-16 6/16/16 Second Class DV-13045
## 4 4 US-2015-108966 2010-11-15 10/18/15 Standard Class SO-20335
## 5 5 US-2015-108966 2010-11-15 10/18/15 Standard Class SO-20335
## 6 6 CA-2014-115812 2006-09-14 6/14/14 Standard Class BH-11710
## Segment Product.ID Category Sub.Category
## 1 Consumer FUR-BO-10001798 Furniture Bookcases
## 2 Consumer FUR-CH-10000454 Furniture Chairs
## 3 Corporate OFF-LA-10000240 Office Supplies Labels
## 4 Consumer FUR-TA-10000577 Furniture Tables
## 5 Consumer OFF-ST-10000760 Office Supplies Storage
## 6 Consumer FUR-FU-10001487 Furniture Furnishings
## Product.Name Sales
## 1 Bush Somerset Collection Bookcase 261.9600
## 2 Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back 731.9400
## 3 Self-Adhesive Address Labels for Typewriters by Universal 14.6200
## 4 Bretford CR4500 Series Slim Rectangular Table 957.5775
## 5 Eldon Fold 'N Roll Cart System 22.3680
## 6 Eldon Expressions Wood and Plastic Desk Accessories, Cherry Wood 48.8600
## Quantity Discount Profit month_year_order
## 1 2 0.00 41.9136 2011-08-01
## 2 3 0.00 219.5820 2011-08-01
## 3 2 0.00 6.8714 2006-12-01
## 4 5 0.45 -383.0310 2010-11-01
## 5 2 0.20 2.5164 2010-11-01
## 6 7 0.00 14.1694 2006-09-01
2.3. Inspect and change data type (if necessary)
# Check data type
str(retail)## 'data.frame': 4042 obs. of 16 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : Date, format: "2011-08-16" "2011-08-16" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category : chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name : chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
## $ month_year_order: Date, format: "2011-08-01" "2011-08-01" ...
- We just need to change the Order.Date to date type and make a new column
month_year_orderwhich takes the month and the year ofOrder.Date
2.4. Check for null values and
# Check for null values
retail %>%
is.na() %>%
colSums()## Row.ID Order.ID Order.Date Ship.Date
## 0 0 0 0
## Ship.Mode Customer.ID Segment Product.ID
## 0 0 0 0
## Category Sub.Category Product.Name Sales
## 0 0 0 0
## Quantity Discount Profit month_year_order
## 0 0 0 0
2.5. Data aggregation
# Take the sums of sales on each month to be analyzed
retail_clean <- retail %>%
group_by(month_year_order) %>%
summarise(Total = sum(Sales)) %>%
select(month_year_order, Total)
range(retail_clean$month_year_order)## [1] "2001-01-01" "2012-12-01"
# Check and pad the clean data if necessary
retail_clean %>%
padr::pad() %>%
is.na() %>%
colSums()## pad applied on the interval: month
## month_year_order Total
## 0 0
3.Exploratory Data Analysis
# Make a TS object
retail_ts <- ts(retail_clean$Total, start = c(2001,1,1), frequency = 12)
autoplot(retail_ts, series = "Actual Sales")# Decompose
retail_dec <- decompose(retail_ts, type = "additive")
autoplot(retail_dec)- The data here is seasonal
- There is an upward trend
- From the trend, there seems to be minimal ups and downs pattern which indicates that there is no multiseasonal tiem series object.
4.Cross Validation
retail_test <- tail(retail_ts, 12)
retail_train <- head(retail_ts, 132)5.Forecast Model Building
5.1. SARIMA
Before building the model, we have to check if the data is stationery or random
# Check if the data is stationary
retail_train %>%
adf.test()##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -3.4109, Lag order = 5, p-value = 0.05583
## alternative hypothesis: stationary
The data is slightly random since the p-value is slightly bigger than alpha (0.05), thus we have to do differencing
retail_train %>%
diff() %>%
adf.test()## Warning in adf.test(.): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: .
## Dickey-Fuller = -8.0822, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
the p-value is already smalller than alpha, thus it is no longer random
# Check for differentation values
retail_train %>%
diff() %>%
tsdisplay()model_sarima <- auto.arima(y = retail_train, seasonal = T)
autoplot(model_sarima$fitted, series = "Sarima") +
autolayer(retail_train, series = "Actual")# Check for AIC values
model_sarima$aic## [1] 2553.371
5.2. Holtz-Winter
model_holtz <- HoltWinters(x = retail_train, alpha = 0.9, beta = 0.1, gamma = F)
autoplot(model_holtz$fitted[,1], series = "hw") +
autolayer(retail_train, series = "Actual")6.Model Evaluation
model_hw_forecast <- forecast(model_holtz, h = 12)
model_sarima_forecast <- forecast(model_sarima, h = 12)#Plot the result of forecasting
retail_train %>%
autoplot() +
autolayer(model_sarima_forecast, series = "SARIMA Forecast") +
autolayer(retail_test, series = "data test")retail_train %>%
autoplot() +
autolayer(model_hw_forecast$mean, series = "HW Forecast") +
autolayer(retail_test, series = "data test")- From here we can see that the SARIMA Forecasting result is better than the HW forecast # 7.Assumption Check ## 7.1. Normality For this check, we need to see if the residuals are normally distributed
- H0: residuals are normally distributed
- H1: residuals are not normally distributed
# Saphiro test
shapiro.test(model_sarima_forecast$residuals)##
## Shapiro-Wilk normality test
##
## data: model_sarima_forecast$residuals
## W = 0.88825, p-value = 1.596e-08
hist(model_sarima_forecast$residuals, breaks = 20)- Since p-value < 0.05, we have to reject H0 and thus the residuals are not distributed normally
- We can also look at the histogram to see if they are evenly distributed or no
7.2. Auto-correlations
For this check, we need to see if there is autocorrelation in the forecast error * H0: No autocorrelation in the forecast error * H1: There is autocorrelation in the forecast error
Box.test(model_sarima_forecast$residuals, type = "Ljung-Box")##
## Box-Ljung test
##
## data: model_sarima_forecast$residuals
## X-squared = 0.48013, df = 1, p-value = 0.4884
Since p-value is > 0.05 we can say that there is no autocorrelation in the forecast error.
But we have to inspect it through a lineplot
plot(model_sarima_forecast$residuals) Through the lineplot, we can see that the distribution is not even, thus L-jung box confirmation is not valid. This usually happens due to unpredictable events such as natural disasters, pandemic, economic crash, etc.
8.Conclusion
From the 2 models performed above, we can see that the SARIMA performs better than the Holtz Winter. We can safely assume that there is no autocorrelation in the residuals for that model. However, there are some improvements that can be made for this analyisis such as using more models and fine tune each of the model parameters.