Sales Promo

Background

For many years, a conservative retail sales promo’s decision depends on what has competitors been doing. Some brands can follow their competitor’s trail, but many has other considerations to down-scale the support budget, optimized it according to their particular audience. With many internal and external factors influence the sales like price MarkDown or Consumer Price Index, there should be a model to forecast a sales gain so in the future stakeholders can be more careful in supporting future’s Sales Promo.

Capstone Metric

I’m interested to answer the question if Machine Learning could help a real case of Sales Promo based on Sales data by a store in the USA, to see if the model can be applied to Indonesia’s Retail as well. This can be a great consideration so any brand or retail can use a pretty wise budget support fort them confidently running the sales promo, or, to hold any promo if there is a predictable variable that giving a bad influence to the model, making a bad sales.

In this project I will try to predict on how effective a price markdown to increase sales using Retail Dataset from Kaggle here https://www.kaggle.com/manjeetsingh/retaildataset. The dataset consists of historical sales data of 45 stores with each one containing a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. The dataset is ordinarily an excel sheet, with 3 tabs but Manjeet Singh has changed it into three csv format: Stores, Features, Sales. I will evaluate further whether or not to use whicch variables that is very significant in the Explorating Data Analysis chapter.

But first, let us peek a little about the data here:

Libraries

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
library(plotly)
library(TSstudio)
library(forecast)
library(xts)
library(corrplot)

Importing Data

# importing data
stores <- read_csv("retaildataset/stores data-set.csv")
features <- read_csv("retaildataset/Features data set.csv")
sales <-  read_csv("retaildataset/sales data-set.csv")

Here are what is in each of the column:

STORES
- Anonymized information about the 45 stores, indicating the type and size of store
FEATURES
- Store - the store number
- Date - the week
- MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday -whether the week is a special holiday week

SALES
Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:
- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales - sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week

Since the Storesdataset didn’t giving meaningful information about the seasonal sales, I’m going to take it down. Also, the Features and Sales dataset has different row, so I will pre-processing the data later and combine them all.

Preprocessing Data

FEATURES Dataset

Since we don’t need Temperature, Fuel_Price, Unemployment and IsHoliday, I wil take them down and then will change the data class on several columns

features_clean <- features %>% 
  select(-Temperature, -Fuel_Price, -Unemployment, -IsHoliday) %>% 
  mutate(Store = as.factor(Store),
         Date = as_date(dmy(Date)))
features_clean <- features_clean[order(features_clean$Date),]
str(features_clean)

## Classes 'tbl_df', 'tbl' and 'data.frame':    8190 obs. of  8 variables:
##  $ Store    : Factor w/ 45 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Date     : Date, format: "2010-02-05" "2010-02-05" ...
##  $ MarkDown1: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI      : num  211 211 214 126 212 ...

colSums((is.na(features_clean)))

##     Store      Date MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5       CPI 
##         0         0      4158      5269      4577      4726      4140       585

For missing value on each of MarkDown and CPI, I will replace them with 0.

Replace missing value with 0 because we need to know the seasonality in the Markdown. Seasonality will deifne when is a sales promo is taken place and so it will give us a meaningful insight which season is the most effective.

features_zero <- features_clean %>% 
  mutate(MarkDown1 = replace_na(MarkDown1,0),
         MarkDown2 = replace_na(MarkDown2,0),
         MarkDown3 = replace_na(MarkDown3,0),
         MarkDown4 = replace_na(MarkDown4,0),
         MarkDown5 = replace_na(MarkDown5,0),
         CPI = replace_na(CPI,0))
head(features_zero)

And now order the column based on the date

features_o <- features_zero %>%
  group_by(Store, Date) %>%
  summarise(MarkDown1 = mean(MarkDown1),
            MarkDown2 = mean(MarkDown2),
            MarkDown3 = mean(MarkDown3),
            MarkDown4 = mean(MarkDown4),
            MarkDown5 = mean(MarkDown5),
            CPI = mean(CPI))

SALES dataset

Take down Dept since it doesn’t define what kind of each Dept is, and IsHoliday, then change the class on several columns

sales_clean <- sales %>% 
  select(-IsHoliday) %>% 
  mutate(Store = as.factor(Store),
         Dept = as.factor(Dept),
         Date = as_date(dmy(Date)))
head(sales_clean)

On sales_clean dataset, the Weekly_Sales filled in based on each department. Since there is no definition on what kind of department they are, each Dept means the same. So Next, I will combine (SUM) the Weekly_Sales based on the same Date.

Sum the Weekly_Sales by the same Date

sales_fin <- sales_clean %>% 
  group_by(Store,Date) %>% 
  summarise(Weekly_Sales = sum(Weekly_Sales)) %>% 
  as.data.frame()

Data Exploration

SALES dataset - See how the Weekly_Sales going in Store 1

sales_store1 <- sales_fin %>% filter(Store == 1)
ggplot(data = sales_store1, aes(x = Date, y = Weekly_Sales, colour = Store)) +
  geom_line(show.legend = F) +
  ggtitle("Weekly Sales on Store 1")

During end of year, there are peaks on sales from each of the sample Store 1 and Store 2, indicates the Christmas season is the highest sales in each year of 2010 and 2011.

Let’s see if the Markdown in each Store 1 & Store 2 behave the same, I choose MarkDown1 for example

features_store1 <- features_o %>% filter(Store == c(1,2))
ggplot(data = features_store1, aes(x = Date, y = MarkDown1, colour = Store)) +
  geom_line()

There is no Weekly_sales information in the year of 2010 and 2011, but the MarkDown1 recorded from last quarter of 2012 and we can see similar path with the sales_fin data that during Christmas, Store 1 and Store 2 create a MarkDown (or price down) to induce sales. I had try all the 5 MarkDown and the MarkDown1 is the most suitable causal for induce sales. I will later use the MarkDown1 to be see if this factor can be a good predictor to induce Sales during Holiday like Christmas.

Let see if the period between several week between the MarkDown data reflecting positively with the Weekly Sales data.

# subset feature_o into Oct 2011 - Oct 2012 to make the same row length with sales_store1
features_store1 <- features_o %>% filter(Store == 1)  %>%  filter(Date <= "2012-10-26")

marketing_campaign <- cbind.data.frame(features_store1, sales_store1$Weekly_Sales)
mkt_plot <- ggplot(marketing_campaign, aes(x = Date, y=sales_store1$Weekly_Sales)) +
  geom_col(aes(fill=MarkDown1)) +
  geom_line(color="orange") +
  ylab("Weekly Sales")
ggplotly(mkt_plot)

After I tried to apply all the MarkDown, I chose MarkDown1 that has a high variation in this period of time, where we get the ThanksGiving, Christmas and Valentine’s Day moment are happening. These are the insights from the plot above:
1. During October until the first week of November, there is no MarkDown data
2. In MarkDown1, I get the insight where starting a month or 3-weeks before the Holiday is due, the MarkDown price is applied.
3. When the Holiday week is due, the sales were increasing but with little MarkDown Price.

Let’s see the data distribution on marketing_campaign

# hHistogram of Weekly Sales data
ggplot(marketing_campaign, aes(x=`sales_store1$Weekly_Sales`, fill="count")) + geom_histogram(color="black", show.legend = F) + xlab("Weekly Sales")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(marketing_campaign$`sales_store1$Weekly_Sales`)

# Time Series Plot
mkt_ts <- ts(data = marketing_campaign$`sales_store1$Weekly_Sales`, frequency= 4, deltat = 1/52)
ts_info(mkt_ts)

##  The mkt_ts series is a ts object with 1 variable and 143 observations
##  Frequency: 4 
##  Start time: 1 1 
##  End time: 36 3

autoplot(mkt_ts, main = "Time Series on Weekly Sales")

# See deeper on its Seasonality
mkt_dc <- decompose(mkt_ts)
autoplot(mkt_dc)

Insights from these plots are:
1. The histogram resume that the data is skewed to 1,600,000 USD
2. From the boxplot, there are outliers valued more that 1,900,000 USD
3. From the Time Series Plot, the trend within 3 years is additive.
4. The decompose function shows there is seasonality annualy (around Christmas)

Let’s see if other variable predictors have correlation between each of them to avoid multicolinearity. I’ll use corrplot().

corrplot(cor(marketing_campaign[,c(3:9)]), type="upper", method="ellipse", tl.cex=0.9, )

All the numerical variables has low correlation with Weekly Sales data, it is safe to say that we can use the CPI or Markdown price to be the predictor variables in the future model.

DCD 8 Retail Sales Promo

Andina Septia

3/20/2020