Online E-commerce Data- A Time Series Analysis Using Prophet Model

About the Data:

Company: UK’s based online retail store mainly dealing with wholesale market.These wholesalers are entities buying products in bulk, generally with a low price tag.

Data More customers mean more sales and more business. Intuitively, it is safe to assume that a company’s ability to attract increasing number of customers has a direct relationship to its increasing sales. With the help of our data, we try to forecast and predict the future sales to understand and generate insights about the company’s overal performance- in terms of both wholesaler and customer interaction,

The online retail dataset and its related information can be found on https://archive.ics.uci.edu/ml/datasets/online+retail.

Variation in the data set The data generating process of most sales variables, is a time series function. That being said, the entire variation in the data set might not be attributed only due to the time series nature of it, but there might also be some other variables that affect it (the cross sectional aspect of it).

To begin the project we will install some libraries for data wrangling and we will import the training data set.

Importing packages

options(warn=-1)
library(astsa, quietly=TRUE, warn.conflicts=FALSE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
## 
##     gas
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v readr   2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Import the data

#Import data
setwd("C:\\Users\\shree\\Downloads")
retail_df <- read.csv("Online Retail.csv")

The dataset has 8 attributes ranging from Invoice (Invoice Number), StockCode (Product Item Code), Description (Product Name), Quantity (Number of items bought), InvoiceDate (Time of Purchase), Customer ID (Unique Customer Identification), Price (Price in Pounds), and Country (Place of Transaction). Additionally, the dataset is supposedly large with 1033036 instances.

We have extracted the date and the hour from the datetime variable and created Invoice Date and Invoice Hour.

str(retail_df)
## 'data.frame':    541909 obs. of  10 variables:
##  $ InvoiceNo      : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode      : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description    : chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity       : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDateTime: chr  "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
##  $ UnitPrice      : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID     : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country        : chr  "India" "India" "India" "India" ...
##  $ InvoiceDate    : chr  "12/1/2010" "12/1/2010" "12/1/2010" "12/1/2010" ...
##  $ InvoiceHour    : chr  "8:26:00 AM" "8:26:00 AM" "8:26:00 AM" "8:26:00 AM" ...

A glimpse of the data

head(retail_df,5)
##   InvoiceNo StockCode                         Description Quantity
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6
## 2    536365     71053                 WHITE METAL LANTERN        6
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6
##   InvoiceDateTime UnitPrice CustomerID Country InvoiceDate InvoiceHour
## 1  12/1/2010 8:26      2.55      17850   India   12/1/2010  8:26:00 AM
## 2  12/1/2010 8:26      3.39      17850   India   12/1/2010  8:26:00 AM
## 3  12/1/2010 8:26      2.75      17850   India   12/1/2010  8:26:00 AM
## 4  12/1/2010 8:26      3.39      17850   India   12/1/2010  8:26:00 AM
## 5  12/1/2010 8:26      3.39      17850   India   12/1/2010  8:26:00 AM

Data pre-processing

Extracting Date and Hour Columns: We have extracted the date and the hour of the day columns from the datetime field.

Feature Engineering: We created a new feature: Total Order price= No of Units sold* Price Per Unit and wish to study the behavior of this variable across time.

retail_df<-na.omit(retail_df)
retail_df<-retail_df%>%filter(Quantity>0)
retail_df<-retail_df%>%mutate(price=Quantity*UnitPrice)
DataTS<-retail_df%>%mutate(price=Quantity*UnitPrice)%>%
  group_by(InvoiceDate)%>%summarize(sum(price))
names(DataTS)<-c('date','TotalOrderPrice')
str(DataTS)
## tibble [305 x 2] (S3: tbl_df/tbl/data.frame)
##  $ date           : chr [1:305] "1/10/2011" "1/11/2011" "1/12/2011" "1/13/2011" ...
##  $ TotalOrderPrice: num [1:305] 15347 60269 17154 15171 39532 ...
DataTS<-DataTS%>%mutate(Date=as.Date(date, format = "%m/%d/%Y"))%>%select(c(TotalOrderPrice,Date))

Time series Plot At A Glance

DataTS %>%
  ggplot() +
  geom_line(aes(Date, TotalOrderPrice)) +
  theme_bw() +
  ggtitle("Total Order Price Over time") +
  theme(panel.grid=element_blank())+
  ylab("WholeSale Prices") +
  xlab("Date")

Insights:

  1. The data has a lot of variations and is variance non-stationary.
  2. There seems to be some increasing trend in the data- mean non stationary.

We will be diving deep into it with the help of Facebook Prophet Model.

Fitting the prophet model

options(warn=-1)
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice
prophet_data <- DataTS %>%
  mutate(ds= Date ,y = TotalOrderPrice)
model<-prophet(prophet_data)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

Forecasting for the next 3 months data.

future = make_future_dataframe(model,periods = 90)
forecast = predict(model,future)

plot(model,forecast)+
ylab("Total Order Price Forecast")+xlab("Date")+theme_bw()

options(warn=-1)
prophet_plot_components(model,forecast)

Insights: a) We observe an increading trend with time- more wholesalers transacting on the platform. b) It is no surprise that on Sundays have lesser sales because we are dealing with wholesale market data. These might not be operational on Sundays.

options(warn=-1)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Total Order Price")

Insights: a) From the change points we observe that there has been a sudden increase from August 2011 onwards. These might be because more wholesalers started their transactions from our platform.

prophet_data$floor = 0
prophet_data$cap = 100000
future$floor = 0
future$cap = 100000
sat_model <- prophet(prophet_data,growth='logistic')
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
sat_2months_forecast = predict(sat_model,future)

plot(sat_model,sat_2months_forecast)+ylim(0,100000)+
  theme_bw()+xlab("Date")+ylab("Total Order Price")

prophet_plot_components(sat_model, sat_2months_forecast)

multi = prophet(prophet_data,seasonality.mode = 'multiplicative',growth='logistic')
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(0,100000)+
  theme_bw()+xlab("Date")+ylab("Total Order Price")

prophet_plot_components(multi, multi_fcst)

Decision: We go ahead with a multiplicative model over an additive model, as we observe that the variations keep increasing over time.

Final Model

model = prophet(prophet_data,seasonality.mode = 'multiplicative',growth='logistic',fit=FALSE)
model = add_country_holidays(model,country_name = 'UK')
model = fit.prophet(model,prophet_data)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future)
prophet_plot_components(model,forecast)

Insights: a) Saturdays are having peaks while Sundays having a decline.This can happen because the wholesalers stock up on Saturdays. b) We also observe dips during holidays- wholesalers do not work on holidays. C) An overall increasing trend is observed with time.

Model Performance

options(warn =-1)
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds))


RMSE = sqrt(mean((prophet_data$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(prophet_data$y - forecast_metric_data$yhat))
MAPE = mean(abs((prophet_data$y - forecast_metric_data$yhat)/prophet_data$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 60967.81"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 25204.31"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 1.06"

Cross-Validation on Time Series Data

I am starting with training on the first 100 days, and then predicting for the next 30days. After every step, we keep adding 30more days to the training dataset.

options(warn = -1)
df.cv <- cross_validation(model, initial = 100, period = 30, horizon = 30, units = 'days')
## Making 9 forecasts with cutoffs between 2011-03-14 and 2011-11-09
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("TOtal Order Price")+
  scale_color_discrete(name = 'Cutoff')

p1<-plot_cross_validation_metric(df.cv, metric = 'rmse')
p2<-plot_cross_validation_metric(df.cv, metric = 'mae')
p3<-plot_cross_validation_metric(df.cv, metric = 'mape')
gridExtra::grid.arrange(p1, p2,p3, nrow = 1)

Insights: