For many years, a conservative retail sales promo’s decision depends on what has competitors been doing. Some brands can follow their competitor’s trail, but many has other considerations to down-scale the support budget, optimized it according to their particular audience. With many internal and external factors influence the sales like price MarkDown or Consumer Price Index, there should be a model to forecast a sales gain so in the future stakeholders can be more careful in supporting future’s Sales Promo.
Capstone Metric
I’m interested to answer the question if Machine Learning could help a real case of Sales Promo based on Sales data by a store in the USA, to see if the model can be applied to Indonesia’s Retail as well. This can be a great consideration so any brand or retail can use a pretty wise budget support fort them confidently running the sales promo, or, to hold any promo if there is a predictable variable that giving a bad influence to the model, making a bad sales.
In this project I will try to predict on how effective a price markdown to increase sales using Retail Dataset from Kaggle here https://www.kaggle.com/manjeetsingh/retaildataset. The dataset consists of historical sales data of 45 stores with each one containing a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. The dataset is ordinarily an excel sheet, with 3 tabs but Manjeet Singh has changed it into three csv format: Stores, Features, Sales. I will evaluate further whether or not to use whicch variables that is very significant in the Explorating Data Analysis chapter.
But first, let us peek a little about the data here:
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(xts)
library(corrplot)# importing data
stores <- read_csv("retaildataset/stores data-set.csv")
features <- read_csv("retaildataset/Features data set.csv")
sales <- read_csv("retaildataset/sales data-set.csv")Here are what is in each of the column:
STORES
- Anonymized information about the 45 stores, indicating the type and size of store
FEATURES
- Store - the store number
- Date - the week
- MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday -whether the week is a special holiday week
SALES
Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:
- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales - sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week
Since the Storesdataset didn’t giving meaningful information about the seasonal sales, I’m going to take it down. Also, the Features and Sales dataset has different row, so I will pre-processing the data later and combine them all.
FEATURES Dataset
Temperature, Fuel_Price, Unemployment and IsHoliday, I wil take them down and then will change the data class on several columnsfeatures_clean <- features %>%
select(-Temperature, -Fuel_Price, -Unemployment, -IsHoliday) %>%
mutate(Store = as.factor(Store),
Date = as_date(dmy(Date)))
features_clean <- features_clean[order(features_clean$Date),]
str(features_clean)## Classes 'tbl_df', 'tbl' and 'data.frame': 8190 obs. of 8 variables:
## $ Store : Factor w/ 45 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : Date, format: "2010-02-05" "2010-02-05" ...
## $ MarkDown1: num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2: num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3: num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4: num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5: num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 211 211 214 126 212 ...
## Store Date MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI
## 0 0 4158 5269 4577 4726 4140 585
For missing value on each of MarkDown and CPI, I will replace them with 0.
features_zero <- features_clean %>%
mutate(MarkDown1 = replace_na(MarkDown1,0),
MarkDown2 = replace_na(MarkDown2,0),
MarkDown3 = replace_na(MarkDown3,0),
MarkDown4 = replace_na(MarkDown4,0),
MarkDown5 = replace_na(MarkDown5,0),
CPI = replace_na(CPI,0))
head(features_zero)features_o <- features_zero %>%
group_by(Store, Date) %>%
summarise(MarkDown1 = mean(MarkDown1),
MarkDown2 = mean(MarkDown2),
MarkDown3 = mean(MarkDown3),
MarkDown4 = mean(MarkDown4),
MarkDown5 = mean(MarkDown5),
CPI = mean(CPI))SALES dataset
Dept since it doesn’t define what kind of each Dept is, and IsHoliday, then change the class on several columnssales_clean <- sales %>%
select(-IsHoliday) %>%
mutate(Store = as.factor(Store),
Dept = as.factor(Dept),
Date = as_date(dmy(Date)))
head(sales_clean)On sales_clean dataset, the Weekly_Sales filled in based on each department. Since there is no definition on what kind of department they are, each Dept means the same. So Next, I will combine (SUM) the Weekly_Sales based on the same Date.
Weekly_Sales by the same DateSALES dataset - See how the Weekly_Sales going in Store 1
sales_store1 <- sales_fin %>% filter(Store == 1)
ggplot(data = sales_store1, aes(x = Date, y = Weekly_Sales, colour = Store)) +
geom_line(show.legend = F) +
ggtitle("Weekly Sales on Store 1")
During end of year, there are peaks on sales from each of the sample Store 1 and Store 2, indicates the Christmas season is the highest sales in each year of 2010 and 2011.
features_store1 <- features_o %>% filter(Store == c(1,2))
ggplot(data = features_store1, aes(x = Date, y = MarkDown1, colour = Store)) +
geom_line()
There is no Weekly_sales information in the year of 2010 and 2011, but the MarkDown1 recorded from last quarter of 2012 and we can see similar path with the sales_fin data that during Christmas, Store 1 and Store 2 create a MarkDown (or price down) to induce sales. I had try all the 5 MarkDown and the MarkDown1 is the most suitable causal for induce sales. I will later use the MarkDown1 to be see if this factor can be a good predictor to induce Sales during Holiday like Christmas.
Let see if the period between several week between the MarkDown data reflecting positively with the Weekly Sales data.
# subset feature_o into Oct 2011 - Oct 2012 to make the same row length with sales_store1
features_store1 <- features_o %>% filter(Store == 1) %>% filter(Date <= "2012-10-26")marketing_campaign <- cbind.data.frame(features_store1, sales_store1$Weekly_Sales)
mkt_plot <- ggplot(marketing_campaign, aes(x = Date, y=sales_store1$Weekly_Sales)) +
geom_col(aes(fill=MarkDown1)) +
geom_line(color="orange") +
ylab("Weekly Sales")
ggplotly(mkt_plot)Let’s see the data distribution on marketing_campaign
# hHistogram of Weekly Sales data
ggplot(marketing_campaign, aes(x=`sales_store1$Weekly_Sales`, fill="count")) + geom_histogram(color="black", show.legend = F) + xlab("Weekly Sales")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Time Series Plot
mkt_ts <- ts(data = marketing_campaign$`sales_store1$Weekly_Sales`, frequency= 4, deltat = 1/52)
ts_info(mkt_ts)## The mkt_ts series is a ts object with 1 variable and 143 observations
## Frequency: 4
## Start time: 1 1
## End time: 36 3
Insights from these plots are:
1. The histogram resume that the data is skewed to 1,600,000 USD
2. From the boxplot, there are outliers valued more that 1,900,000 USD
3. From the Time Series Plot, the trend within 3 years is additive.
4. The decompose function shows there is seasonality annualy (around Christmas)
Let’s see if other variable predictors have correlation between each of them to avoid multicolinearity. I’ll use corrplot().
All the numerical variables has low correlation with
Weekly Sales data, it is safe to say that we can use the CPI or Markdown price to be the predictor variables in the future model.