1 Background

Nowadays, people behavior on shopping are shifting to online shopping. Therefore, number of transaction in online shopping has been increasing and it is predicted to continue the positive trend in the following years according to statistica.

Therefore, it’is important to know the potential sales of their product in the future. In this chance I would like to make forecasting of data e-commerce sales of one of retailer located in United Kingdom. The retailer sells various products from stationary until home products. The data are from 12/1/2010 until 12/9/2011.

1.1 Business Objective

Based on that it is important to know what products ordered the most by customer and to predict how the demand of the products in the following week since december is coming, and the transaction made by this retailer are increasing.

2 Preprocessing Data

2.1 Import Library

library(readr) #to read data
library(DT) #to visualize data in table
library(lubridate) #to dea with data
library(tidyverse) #for data wrangling 
library(dplyr) #for data wrangling
library(ggplot2) #for basic EDA
library(TSstudio) #time series library
library(padr) # for padding
library(forecast) # for forecasting
library(tseries) # for adf.test
library(MLmetrics)#for calculating error
library(yardstick) #for measuing forecast performance
library(purrr) #for functional programming
library(tidyquant) #for some ggplot aesthetic
library(recipes) #for data preprocess
library(zoo) #for fill NA data
library(tidyr) #for function spread/nest
library(recipes) #for function ts and msts
library(tibble) #for function enframe
library(tidymodels) #for function rmse_vec
library(png) #for function readPNG
library(grid) #for function grid.raster

2.2 Read Data

The data was obtained from kaggle which consist of 8 variables; invoice number, Stock Code, Product Description, Quantity, Invoice Date, Unit Price, Customer ID, and Country. The data from 12/1/2010 until 12/9/2011.

sales <- read_csv("data.csv")
sales

Here I delve specific details of variables above :
* InvoiceNo : invoice number
* StockCode : code of product ordered by customer
* Description : description of the product
* Quantity : quantity ordered
* InvoiceDate : Date of order made
* UnitPrice : Price of the product
* CustomerID : ID of the customer
* Country : Country where the customer from / make the order

# check data structure
str(sales)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: chr  "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num  17850 17850 17850 17850 17850 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   InvoiceNo = col_character(),
##   ..   StockCode = col_character(),
##   ..   Description = col_character(),
##   ..   Quantity = col_double(),
##   ..   InvoiceDate = col_character(),
##   ..   UnitPrice = col_double(),
##   ..   CustomerID = col_double(),
##   ..   Country = col_character()
##   .. )

The invoice date should be convert into Date format.

#change `invoiceDate' format to Date
sales <- sales %>% 
  mutate(InvoiceDate = mdy_hm(InvoiceDate))

str(sales)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num  17850 17850 17850 17850 17850 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

The invoice date is converted into date format.

# checking if there is missing data value
colSums(is.na(sales))
##   InvoiceNo   StockCode Description    Quantity InvoiceDate   UnitPrice 
##           0           0        1454           0           0           0 
##  CustomerID     Country 
##      135080           0

There are 135080 customers are unknown! In addition, we have 1454 of missing descriptions.

# checking summary of the data
summary(sales)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##   InvoiceDate                    UnitPrice           CustomerID    
##  Min.   :2010-12-01 08:26:00   Min.   :-11062.06   Min.   :12346   
##  1st Qu.:2011-03-28 11:34:00   1st Qu.:     1.25   1st Qu.:13953   
##  Median :2011-07-19 17:17:00   Median :     2.08   Median :15152   
##  Mean   :2011-07-04 13:34:57   Mean   :     4.61   Mean   :15288   
##  3rd Qu.:2011-10-19 11:27:00   3rd Qu.:     4.13   3rd Qu.:16791   
##  Max.   :2011-12-09 12:50:00   Max.   : 38970.00   Max.   :18287   
##                                                    NA's   :135080  
##    Country         
##  Length:541909     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The UnitPrice and Quantity should not be minus value. Therefore we need to remove the minus data, since it shows that the goods sold were damaged or returned.

#drop minus data from variable Quantity and Unit Price
sales <- sales %>% 
  filter(Quantity >= 0,
         UnitPrice >= 1) 

#check the summary of the data
summary(sales)
##   InvoiceNo          StockCode         Description           Quantity       
##  Length:419211      Length:419211      Length:419211      Min.   :    1.00  
##  Class :character   Class :character   Class :character   1st Qu.:    1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    3.00  
##                                                           Mean   :    8.28  
##                                                           3rd Qu.:    8.00  
##                                                           Max.   :80995.00  
##                                                                             
##   InvoiceDate                    UnitPrice           CustomerID    
##  Min.   :2010-12-01 08:26:00   Min.   :    1.000   Min.   :12346   
##  1st Qu.:2011-03-23 14:55:00   1st Qu.:    1.650   1st Qu.:13975   
##  Median :2011-07-17 15:47:00   Median :    2.890   Median :15153   
##  Mean   :2011-07-02 12:13:52   Mean   :    4.776   Mean   :15297   
##  3rd Qu.:2011-10-18 17:00:00   3rd Qu.:    4.950   3rd Qu.:16805   
##  Max.   :2011-12-09 12:50:00   Max.   :13541.330   Max.   :18287   
##                                                    NA's   :114300  
##    Country         
##  Length:419211     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

The data now is clean. so we can proceed to the next step.

3 Basic Exploratory Data Analysis

sales_agg <- sales %>% 
  mutate(Date = floor_date(InvoiceDate, unit = "month")) %>% 
  group_by(Date) %>% 
  summarise(order = n()) %>% 
  pad() %>% 
  fill_by_value(order, value = 0) %>% 
  ungroup()
  
  
sales_agg_fil <- sales_agg %>% 
  filter(Date < "2011-12-01")

#Visualize number of order made from Dec 2010 - Nov 2011

plot1 <- ggplot(data = sales_agg_fil, aes(x=Date, y=order, label = order))+
  geom_line()+
  theme_classic()+
  geom_label(aes(fill = order), colour = "white", fontface = "bold")+
  labs(title = "Total Order Made in a UK Online Retailer Platform",
       subtitle = "Data from Dec 2010 - Jan 2011",
       x = "Date",
       y = "Number of Order")

plot1

Based on the picture above, we know that transaction data of this retailer face increasing trend especially, for the end of 2011. However, if take a look back to the end of 2010, we know that the total order were 35,116 before it decreased in the following months. Therefore, it is important to know the total demand that could happen in Dec 2011, since it’s about to christmas and new year where people are tend to increase their shopping. Furthermore, the retailer should get prepared for this. In this chance, I would like to forecast the demand for top 3 products sold by this retailer along that period.

Besides, sell in the United Kingdom, this retailer also sell its product abroad. Here is the top 10 of country that ordered to this retailer.

# Top 10 Country Made Order
sales_country <- sales %>% 
  group_by(Country) %>% 
  summarise(order = n()) %>% 
  ungroup() %>% 
  arrange(desc(order)) %>% 
  head(10)

plot_top_country <- ggplot(data = sales_country, aes(x=reorder(Country, order),
                                                     y = order, label = order))+
  geom_col(aes(fill = order), show.legend = T)+
  coord_flip()+
  theme_bw()+
  theme(axis.text = element_text(size = 12), axis.title = element_text(size = 18, colour = "black"))+
  geom_label(aes(fill = order),
             colour = "white",
             fontface = "bold",
             size = 5,
             position = position_stack(1))+
  labs(title = "Number of Order made Within Country",
       subtitle = "Top 10 Country Made order at a UK Online Retailer Platform from Dec 2010 - Dec 2011",
       x = "Country",
       y = "Number of Order")

plot_top_country

Based on the picture above, we can see that majority products of this UK online retailer were sold to people in the UK. Several products were exported to Germany (6.923), France (6.382) and EIRE/ Ireland (6.183).

# Top 10 category product Made at E-commerce 
sales_product <- sales %>% 
  group_by(StockCode) %>% 
  summarise(order = n()) %>% 
  ungroup() %>% 
  arrange(desc(order)) %>% 
  head(10)

sales_product <- sales_product %>% 
  mutate(StockCode = case_when(StockCode == "85123A" ~ "White Hanging",
                               StockCode == "85099B" ~ "Jumbo Bag Red",
                               StockCode == "22423" ~ "Regency CakeStand",
                               StockCode == "47566" ~ "Tea Time Pantry Bunting",
                               StockCode == "20725" ~ "Lunch Bag Retrospot",
                               StockCode == "84879" ~ "Assorted Colour bird ornament",
                               StockCode == "22720" ~ "Set of Cake Tins Party Design",
                               StockCode == "20727" ~ "Lunch Bag Black Skull",
                               StockCode == "22383" ~ "Lunch Bag Suki Design",
                               StockCode == "22457" ~ "Natural Slate Heart Chalkboard",
                               StockCode == "23843" ~ "Paper Craft"))

plot_product <- ggplot(data = sales_product, aes(x=reorder(StockCode, order),
                                                     y = order, label = order))+
  geom_col(aes(fill = order), show.legend = T)+
  coord_flip()+
  theme_bw()+
  theme(axis.text = element_text(size = 12), axis.title = element_text(size = 18, colour = "black"))+
  geom_label(aes(fill = order),
             colour = "white",
             fontface = "bold",
             size = 5,
             position = position_stack(0.8))+
  labs(title = "Number of Order Based on Category Product",
       subtitle = "Top 10 Product Ordered at E-commerce Platform from Dec 2010 - Dec 2011",
       x = "Product Category",
       y = "Number of Order")

plot_product

White hanging (85123A), Jumbo Bag Red (85099B) and Regency CakeStand (22423) is top 3 product ordered in this retailer platform. Therefore for the forecasting we focus on this three products.

4 Make Time Series Format

ecom <- sales %>%
  filter(StockCode %in% c("85123A","85099B","22423")) %>% 
  mutate(Date = floor_date(InvoiceDate, unit = "day"))

ecom <- ecom %>% 
  group_by(StockCode, Date) %>% 
  summarise(order = n()) %>% 
  pad() %>% 
  fill_by_value(order, value = 0) %>% 
  ungroup()

4.0.0.1 To visualize total order based on product :

plot2 <- ggplot(data = ecom, aes(x=Date, y=order))+
  geom_line()+
  labs(x = NULL, Y = NULL)+
  facet_wrap(~ StockCode, scale ="free", ncol=1)+
  tidyquant::theme_tq()

plot2

#check the range of the data
range(ecom$Date)
## [1] "2010-12-01 UTC" "2011-12-09 UTC"

The series data start from 2010-12-01 UTC until 2011-12-09 UTC

4.1 Cross Validation Scheme

In this step we need to the following steps as followed :
1. Splitting data into data train and data test.

4.1.1 Splitting Data

In this splitting data, we do not use rolling origin method, we use latest 7 days data series for data validation by the product category. It is because this analysis’ purpose is to forecast total demand for 7 days.

# train-test-size
test_size <- 7 # the total of data validation is one week (7 days) 
train_size <- nrow(ecom)/3 - test_size

# get the min-max of the time index for each sample
test_end <- max(ecom$Date)
test_start <- test_end - days(test_size) + days(1)

train_end <- test_start - days(1)
train_start <- train_end - days(train_size) + days(1)
# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)

intrain
## [1] 2010-12-01 UTC--2011-12-02 UTC
intest
## [1] 2011-12-03 UTC--2011-12-09 UTC

Visualize the train and test series using interval

# plot the train and test
ecom %>% 
  mutate(sample = case_when(
    Date %within% intrain ~ "train",
    Date %within% intest ~ "test"
  )) %>% 
  drop_na() %>% 
  mutate(sample = factor(sample, levels = c("train","test"))) %>% 
  ggplot(aes(x=Date, y = order, color = sample)) +
    geom_line()+
    labs(x=NULL, y=NULL, color = NULL)+
    facet_wrap(~ StockCode, scale = "free", ncol =1)+
    tidyquant::theme_tq()+
    tidyquant::scale_colour_tq()