Nowadays, people behavior on shopping are shifting to online shopping. Therefore, number of transaction in online shopping has been increasing and it is predicted to continue the positive trend in the following years according to statistica.
Therefore, it’is important to know the potential sales of their product in the future. In this chance I would like to make forecasting of data e-commerce sales of one of retailer located in United Kingdom. The retailer sells various products from stationary until home products. The data are from 12/1/2010 until 12/9/2011.
Based on that it is important to know what products ordered the most by customer and to predict how the demand of the products in the following week since december is coming, and the transaction made by this retailer are increasing.
library(readr) #to read data
library(DT) #to visualize data in table
library(lubridate) #to dea with data
library(tidyverse) #for data wrangling
library(dplyr) #for data wrangling
library(ggplot2) #for basic EDA
library(TSstudio) #time series library
library(padr) # for padding
library(forecast) # for forecasting
library(tseries) # for adf.test
library(MLmetrics)#for calculating error
library(yardstick) #for measuing forecast performance
library(purrr) #for functional programming
library(tidyquant) #for some ggplot aesthetic
library(recipes) #for data preprocess
library(zoo) #for fill NA data
library(tidyr) #for function spread/nest
library(recipes) #for function ts and msts
library(tibble) #for function enframe
library(tidymodels) #for function rmse_vec
library(png) #for function readPNG
library(grid) #for function grid.raster
The data was obtained from kaggle which consist of 8 variables; invoice number, Stock Code, Product Description, Quantity, Invoice Date, Unit Price, Customer ID, and Country. The data from 12/1/2010 until 12/9/2011.
sales <- read_csv("data.csv")
sales
Here I delve specific details of variables above :
* InvoiceNo : invoice number
* StockCode : code of product ordered by customer
* Description : description of the product
* Quantity : quantity ordered
* InvoiceDate : Date of order made
* UnitPrice : Price of the product
* CustomerID : ID of the customer
* Country : Country where the customer from / make the order
# check data structure
str(sales)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: chr "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num 17850 17850 17850 17850 17850 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
## - attr(*, "spec")=
## .. cols(
## .. InvoiceNo = col_character(),
## .. StockCode = col_character(),
## .. Description = col_character(),
## .. Quantity = col_double(),
## .. InvoiceDate = col_character(),
## .. UnitPrice = col_double(),
## .. CustomerID = col_double(),
## .. Country = col_character()
## .. )
The invoice date should be convert into Date format.
#change `invoiceDate' format to Date
sales <- sales %>%
mutate(InvoiceDate = mdy_hm(InvoiceDate))
str(sales)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num 17850 17850 17850 17850 17850 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
The invoice date is converted into date format.
# checking if there is missing data value
colSums(is.na(sales))
## InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
## 0 0 1454 0 0 0
## CustomerID Country
## 135080 0
There are 135080 customers are unknown! In addition, we have 1454 of missing descriptions.
# checking summary of the data
summary(sales)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. :-11062.06 Min. :12346
## 1st Qu.:2011-03-28 11:34:00 1st Qu.: 1.25 1st Qu.:13953
## Median :2011-07-19 17:17:00 Median : 2.08 Median :15152
## Mean :2011-07-04 13:34:57 Mean : 4.61 Mean :15288
## 3rd Qu.:2011-10-19 11:27:00 3rd Qu.: 4.13 3rd Qu.:16791
## Max. :2011-12-09 12:50:00 Max. : 38970.00 Max. :18287
## NA's :135080
## Country
## Length:541909
## Class :character
## Mode :character
##
##
##
##
The UnitPrice and Quantity should not be minus value. Therefore we need to remove the minus data, since it shows that the goods sold were damaged or returned.
#drop minus data from variable Quantity and Unit Price
sales <- sales %>%
filter(Quantity >= 0,
UnitPrice >= 1)
#check the summary of the data
summary(sales)
## InvoiceNo StockCode Description Quantity
## Length:419211 Length:419211 Length:419211 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 8.28
## 3rd Qu.: 8.00
## Max. :80995.00
##
## InvoiceDate UnitPrice CustomerID
## Min. :2010-12-01 08:26:00 Min. : 1.000 Min. :12346
## 1st Qu.:2011-03-23 14:55:00 1st Qu.: 1.650 1st Qu.:13975
## Median :2011-07-17 15:47:00 Median : 2.890 Median :15153
## Mean :2011-07-02 12:13:52 Mean : 4.776 Mean :15297
## 3rd Qu.:2011-10-18 17:00:00 3rd Qu.: 4.950 3rd Qu.:16805
## Max. :2011-12-09 12:50:00 Max. :13541.330 Max. :18287
## NA's :114300
## Country
## Length:419211
## Class :character
## Mode :character
##
##
##
##
The data now is clean. so we can proceed to the next step.
sales_agg <- sales %>%
mutate(Date = floor_date(InvoiceDate, unit = "month")) %>%
group_by(Date) %>%
summarise(order = n()) %>%
pad() %>%
fill_by_value(order, value = 0) %>%
ungroup()
sales_agg_fil <- sales_agg %>%
filter(Date < "2011-12-01")
#Visualize number of order made from Dec 2010 - Nov 2011
plot1 <- ggplot(data = sales_agg_fil, aes(x=Date, y=order, label = order))+
geom_line()+
theme_classic()+
geom_label(aes(fill = order), colour = "white", fontface = "bold")+
labs(title = "Total Order Made in a UK Online Retailer Platform",
subtitle = "Data from Dec 2010 - Jan 2011",
x = "Date",
y = "Number of Order")
plot1
Based on the picture above, we know that transaction data of this retailer face increasing trend especially, for the end of 2011. However, if take a look back to the end of 2010, we know that the total order were 35,116 before it decreased in the following months. Therefore, it is important to know the total demand that could happen in Dec 2011, since it’s about to christmas and new year where people are tend to increase their shopping. Furthermore, the retailer should get prepared for this. In this chance, I would like to forecast the demand for top 3 products sold by this retailer along that period.
Besides, sell in the United Kingdom, this retailer also sell its product abroad. Here is the top 10 of country that ordered to this retailer.
# Top 10 Country Made Order
sales_country <- sales %>%
group_by(Country) %>%
summarise(order = n()) %>%
ungroup() %>%
arrange(desc(order)) %>%
head(10)
plot_top_country <- ggplot(data = sales_country, aes(x=reorder(Country, order),
y = order, label = order))+
geom_col(aes(fill = order), show.legend = T)+
coord_flip()+
theme_bw()+
theme(axis.text = element_text(size = 12), axis.title = element_text(size = 18, colour = "black"))+
geom_label(aes(fill = order),
colour = "white",
fontface = "bold",
size = 5,
position = position_stack(1))+
labs(title = "Number of Order made Within Country",
subtitle = "Top 10 Country Made order at a UK Online Retailer Platform from Dec 2010 - Dec 2011",
x = "Country",
y = "Number of Order")
plot_top_country
Based on the picture above, we can see that majority products of this UK online retailer were sold to people in the UK. Several products were exported to Germany (6.923), France (6.382) and EIRE/ Ireland (6.183).
# Top 10 category product Made at E-commerce
sales_product <- sales %>%
group_by(StockCode) %>%
summarise(order = n()) %>%
ungroup() %>%
arrange(desc(order)) %>%
head(10)
sales_product <- sales_product %>%
mutate(StockCode = case_when(StockCode == "85123A" ~ "White Hanging",
StockCode == "85099B" ~ "Jumbo Bag Red",
StockCode == "22423" ~ "Regency CakeStand",
StockCode == "47566" ~ "Tea Time Pantry Bunting",
StockCode == "20725" ~ "Lunch Bag Retrospot",
StockCode == "84879" ~ "Assorted Colour bird ornament",
StockCode == "22720" ~ "Set of Cake Tins Party Design",
StockCode == "20727" ~ "Lunch Bag Black Skull",
StockCode == "22383" ~ "Lunch Bag Suki Design",
StockCode == "22457" ~ "Natural Slate Heart Chalkboard",
StockCode == "23843" ~ "Paper Craft"))
plot_product <- ggplot(data = sales_product, aes(x=reorder(StockCode, order),
y = order, label = order))+
geom_col(aes(fill = order), show.legend = T)+
coord_flip()+
theme_bw()+
theme(axis.text = element_text(size = 12), axis.title = element_text(size = 18, colour = "black"))+
geom_label(aes(fill = order),
colour = "white",
fontface = "bold",
size = 5,
position = position_stack(0.8))+
labs(title = "Number of Order Based on Category Product",
subtitle = "Top 10 Product Ordered at E-commerce Platform from Dec 2010 - Dec 2011",
x = "Product Category",
y = "Number of Order")
plot_product
White hanging (85123A), Jumbo Bag Red (85099B) and Regency CakeStand (22423) is top 3 product ordered in this retailer platform. Therefore for the forecasting we focus on this three products.
ecom <- sales %>%
filter(StockCode %in% c("85123A","85099B","22423")) %>%
mutate(Date = floor_date(InvoiceDate, unit = "day"))
ecom <- ecom %>%
group_by(StockCode, Date) %>%
summarise(order = n()) %>%
pad() %>%
fill_by_value(order, value = 0) %>%
ungroup()
plot2 <- ggplot(data = ecom, aes(x=Date, y=order))+
geom_line()+
labs(x = NULL, Y = NULL)+
facet_wrap(~ StockCode, scale ="free", ncol=1)+
tidyquant::theme_tq()
plot2
#check the range of the data
range(ecom$Date)
## [1] "2010-12-01 UTC" "2011-12-09 UTC"
The series data start from 2010-12-01 UTC until 2011-12-09 UTC
In this step we need to the following steps as followed :
1. Splitting data into data train and data test.
In this splitting data, we do not use rolling origin method, we use latest 7 days data series for data validation by the product category. It is because this analysis’ purpose is to forecast total demand for 7 days.
# train-test-size
test_size <- 7 # the total of data validation is one week (7 days)
train_size <- nrow(ecom)/3 - test_size
# get the min-max of the time index for each sample
test_end <- max(ecom$Date)
test_start <- test_end - days(test_size) + days(1)
train_end <- test_start - days(1)
train_start <- train_end - days(train_size) + days(1)
# get the interval of each samples
intrain <- interval(train_start, train_end)
intest <- interval(test_start, test_end)
intrain
## [1] 2010-12-01 UTC--2011-12-02 UTC
intest
## [1] 2011-12-03 UTC--2011-12-09 UTC
Visualize the train and test series using interval
# plot the train and test
ecom %>%
mutate(sample = case_when(
Date %within% intrain ~ "train",
Date %within% intest ~ "test"
)) %>%
drop_na() %>%
mutate(sample = factor(sample, levels = c("train","test"))) %>%
ggplot(aes(x=Date, y = order, color = sample)) +
geom_line()+
labs(x=NULL, y=NULL, color = NULL)+
facet_wrap(~ StockCode, scale = "free", ncol =1)+
tidyquant::theme_tq()+
tidyquant::scale_colour_tq()