Company: UK’s based online retail store mainly dealing with wholesale market.These wholesalers are entities buying products in bulk, generally with a low price tag.
Data More customers mean more sales and more business. Intuitively, it is safe to assume that a company’s ability to attract increasing number of customers has a direct relationship to its increasing sales. With the help of our data, we try to forecast and predict the future sales to understand and generate insights about the company’s overal performance- in terms of both wholesaler and customer interaction,
The online retail dataset and its related information can be found on https://archive.ics.uci.edu/ml/datasets/online+retail.
Variation in the data set The data generating process of most sales variables, is a time series function. That being said, the entire variation in the data set might not be attributed only due to the time series nature of it, but there might also be some other variables that affect it (the cross sectional aspect of it).
To begin the project we will install some libraries for data wrangling and we will import the training data set.
options(warn=-1)
library(astsa, quietly=TRUE, warn.conflicts=FALSE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
##
## gas
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v readr 2.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#Import data
setwd("C:\\Users\\shree\\Downloads")
retail_df <- read.csv("Online Retail.csv")
The dataset has 8 attributes ranging from Invoice (Invoice Number), StockCode (Product Item Code), Description (Product Name), Quantity (Number of items bought), InvoiceDate (Time of Purchase), Customer ID (Unique Customer Identification), Price (Price in Pounds), and Country (Place of Transaction). Additionally, the dataset is supposedly large with 1033036 instances.
We have extracted the date and the hour from the datetime variable and created Invoice Date and Invoice Hour.
str(retail_df)
## 'data.frame': 541909 obs. of 10 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description : chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : int 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDateTime: chr "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" "12/1/2010 8:26" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : int 17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
## $ Country : chr "India" "India" "India" "India" ...
## $ InvoiceDate : chr "12/1/2010" "12/1/2010" "12/1/2010" "12/1/2010" ...
## $ InvoiceHour : chr "8:26:00 AM" "8:26:00 AM" "8:26:00 AM" "8:26:00 AM" ...
head(retail_df,5)
## InvoiceNo StockCode Description Quantity
## 1 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
## 2 536365 71053 WHITE METAL LANTERN 6
## 3 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
## 4 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
## 5 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6
## InvoiceDateTime UnitPrice CustomerID Country InvoiceDate InvoiceHour
## 1 12/1/2010 8:26 2.55 17850 India 12/1/2010 8:26:00 AM
## 2 12/1/2010 8:26 3.39 17850 India 12/1/2010 8:26:00 AM
## 3 12/1/2010 8:26 2.75 17850 India 12/1/2010 8:26:00 AM
## 4 12/1/2010 8:26 3.39 17850 India 12/1/2010 8:26:00 AM
## 5 12/1/2010 8:26 3.39 17850 India 12/1/2010 8:26:00 AM
Extracting Date and Hour Columns: We have extracted the date and the hour of the day columns from the datetime field.
Feature Engineering: We created a new feature: Total Order price= No of Units sold* Price Per Unit and wish to study the behavior of this variable across time.
retail_df<-na.omit(retail_df)
retail_df<-retail_df%>%filter(Quantity>0)
retail_df<-retail_df%>%mutate(price=Quantity*UnitPrice)
DataTS<-retail_df%>%mutate(price=Quantity*UnitPrice)%>%
group_by(InvoiceDate)%>%summarize(sum(price))
names(DataTS)<-c('date','TotalOrderPrice')
str(DataTS)
## tibble [305 x 2] (S3: tbl_df/tbl/data.frame)
## $ date : chr [1:305] "1/10/2011" "1/11/2011" "1/12/2011" "1/13/2011" ...
## $ TotalOrderPrice: num [1:305] 15347 60269 17154 15171 39532 ...
DataTS<-DataTS%>%mutate(Date=as.Date(date, format = "%m/%d/%Y"))%>%select(c(TotalOrderPrice,Date))
Time series Plot At A Glance
DataTS %>%
ggplot() +
geom_line(aes(Date, TotalOrderPrice)) +
theme_bw() +
ggtitle("Total Order Price Over time") +
theme(panel.grid=element_blank())+
ylab("WholeSale Prices") +
xlab("Date")
Insights:
We will be diving deep into it with the help of Facebook Prophet Model.
options(warn=-1)
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
## flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
## splice
prophet_data <- DataTS %>%
mutate(ds= Date ,y = TotalOrderPrice)
model<-prophet(prophet_data)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
Forecasting for the next 3 months data.
future = make_future_dataframe(model,periods = 90)
forecast = predict(model,future)
plot(model,forecast)+
ylab("Total Order Price Forecast")+xlab("Date")+theme_bw()
options(warn=-1)
prophet_plot_components(model,forecast)
Insights: a) We observe an increading trend with time- more wholesalers transacting on the platform. b) It is no surprise that on Sundays have lesser sales because we are dealing with wholesale market data. These might not be operational on Sundays.
options(warn=-1)
plot(model,forecast)+add_changepoints_to_plot(model)+theme_bw()+xlab("Date")+ylab("Total Order Price")
Insights: a) From the change points we observe that there has been a sudden increase from August 2011 onwards. These might be because more wholesalers started their transactions from our platform.
prophet_data$floor = 0
prophet_data$cap = 100000
future$floor = 0
future$cap = 100000
sat_model <- prophet(prophet_data,growth='logistic')
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
sat_2months_forecast = predict(sat_model,future)
plot(sat_model,sat_2months_forecast)+ylim(0,100000)+
theme_bw()+xlab("Date")+ylab("Total Order Price")
prophet_plot_components(sat_model, sat_2months_forecast)
multi = prophet(prophet_data,seasonality.mode = 'multiplicative',growth='logistic')
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future)
plot(multi,multi_fcst)+ylim(0,100000)+
theme_bw()+xlab("Date")+ylab("Total Order Price")
prophet_plot_components(multi, multi_fcst)
Decision: We go ahead with a multiplicative model over an additive model, as we observe that the variations keep increasing over time.
model = prophet(prophet_data,seasonality.mode = 'multiplicative',growth='logistic',fit=FALSE)
model = add_country_holidays(model,country_name = 'UK')
model = fit.prophet(model,prophet_data)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future)
prophet_plot_components(model,forecast)
Insights: a) Saturdays are having peaks while Sundays having a decline.This can happen because the wholesalers stock up on Saturdays. b) We also observe dips during holidays- wholesalers do not work on holidays. C) An overall increasing trend is observed with time.
options(warn =-1)
forecast_metric_data = forecast %>%
as_tibble() %>%
mutate(ds = as.Date(ds))
RMSE = sqrt(mean((prophet_data$y - forecast_metric_data$yhat)^2))
MAE = mean(abs(prophet_data$y - forecast_metric_data$yhat))
MAPE = mean(abs((prophet_data$y - forecast_metric_data$yhat)/prophet_data$y))
print(paste("RMSE:",round(RMSE,2)))
## [1] "RMSE: 60967.81"
print(paste("MAE:",round(MAE,2)))
## [1] "MAE: 25204.31"
print(paste("MAPE:",round(MAPE,2)))
## [1] "MAPE: 1.06"
I am starting with training on the first 100 days, and then predicting for the next 30days. After every step, we keep adding 30more days to the training dataset.
options(warn = -1)
df.cv <- cross_validation(model, initial = 100, period = 30, horizon = 30, units = 'days')
## Making 9 forecasts with cutoffs between 2011-03-14 and 2011-11-09
df.cv %>%
ggplot()+
geom_point(aes(ds,y)) +
geom_point(aes(ds,yhat,color=factor(cutoff)))+
theme_bw()+
xlab("Date")+
ylab("TOtal Order Price")+
scale_color_discrete(name = 'Cutoff')
p1<-plot_cross_validation_metric(df.cv, metric = 'rmse')
p2<-plot_cross_validation_metric(df.cv, metric = 'mae')
p3<-plot_cross_validation_metric(df.cv, metric = 'mape')
gridExtra::grid.arrange(p1, p2,p3, nrow = 1)
Insights: