For many years, a conservative retail sales promo’s decision depends on what has competitors been doing. Some brands can follow their competitor’s trail, but many has other considerations to down-scale the support budget, optimized it according to their particular audience. With many internal and external factors influence the sales like price MarkDown or Consumer Price Index, there should be a model to forecast a sales gain so in the future stakeholders can be more careful in supporting future’s Sales Promo.
us_retail <- read.csv("usdeptstoresales.csv")
us_retail <- as.data.frame(us_retail)
us_retail <- us_retail[1:36,]
us_retail <- as.data.frame(us_retail)
colnames(us_retail) <- "sales"
us_retail$sales <- as.numeric(gsub("\\,", "", us_retail$sales))
date = seq(from = as.Date("2010-01-01"), to = as.Date("2012-12-31"), by = 'month')
us_retail <- cbind(us_retail, date)
us_retail$month <- lubridate::month(us_retail$date, label = T)
us_retail$year <- as.character(lubridate::year(us_retail$date))
usretail_plot <- us_retail %>%
group_by(year, month) %>%
summarise(sales) %>%
ungroup() ## `summarise()` regrouping output by 'year' (override with `.groups` argument)
plot_us <- ggplot(usretail_plot, aes(month, sales, group = year, # year diberi as.factor malah error
color = year, text = glue("Year: {year}<br>Sales: {sales}") )) +
geom_line() + geom_point() +
scale_color_manual(values= c("#cc0099", "blue", "black", "grey")) + # scale_color_gradient(low="#cc0099", high="black")
labs(title = "Total Sales US Retail - Dept Store") +
theme_light()
ggplotly(plot_us, tooltip = "text") %>% config(displayModeBar = F)## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
I’m interested to answer the question if Machine Learning could help a real case of Sales Promo based on Sales data by a store in the USA (Walmart), to see if the model can be applied to Indonesia’s Retail as well. This can be a great consideration for any brand or retail can use a pretty wise budget support fort them confidently running the sales promo, or, to hold any promo if there is a predictable variable that giving a bad influence to the model, making a bad sales.
In this project I will try to predict on how effective a price markdown to increase sales using Retail Dataset from Kaggle here https://www.kaggle.com/manjeetsingh/retaildataset. The dataset consists of historical sales data of 45 stores with each one containing a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. The dataset is ordinarily an excel sheet, with 3 tabs but Manjeet Singh has changed it into three csv format: Stores, Features, Sales. I will evaluate further whether or not to use whicch variables that is very significant in the Explorating Data Analysis chapter.
But first, let us peek a little about the data here:
Here are what is in each of the column:
FEATURES
- Store - the store number
- Date - the week
- MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday -whether the week is a special holiday week
SALES
Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:
- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales - sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week
Since the Storesdataset didn’t give meaningful information about the seasonal sales, I’m going to take it down. Also, the Features and Sales dataset has different row, so I will pre-processing the data later and combine them all.
SALES dataset
I’m mutating the Store, Dep, and Date variables into their own suitable type of data, and see the structure of the data which returns 421,570 data observations with 4 variables:
sales_ <- sales %>%
dplyr::select(-IsHoliday) %>%
mutate(Store = as.factor(Store),
Dept = as.factor(Dept),
Date = as_date(dmy(Date)))
str(sales_)## Classes 'tbl_df', 'tbl' and 'data.frame': 421570 obs. of 4 variables:
## $ Store : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : Factor w/ 81 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2010-02-05" "2010-02-12" ...
## $ Weekly_Sales: num 24924 46039 41596 19404 21828 ...
## Store Dept Date Weekly_Sales
## 0 0 0 0
sales_ <- sales_ %>%
group_by(Store,Date) %>%
summarise(Weekly_Sales = sum(Weekly_Sales)) %>%
ungroup()Since we will forecasting the target variables weekly_sales, it is natural if we process further which store has the highest sales so we can teach the Machine Learning to examine the fittest model. Let’s see which Store has the highest Sales:
## `summarise()` ungrouping output (override with `.groups` argument)
It says that Store 20 has the highest sales, so I will use only Store 20 data. I will use other store’s weekly_sales data after we get the best model to Forecast Sales.
Since I need some of the data to be tested after I get the best model, I will subset the last 2 months and save it for further exploration.
salesX <- sales_ %>% filter(Store == 20) %>% filter(Date <= "2012-09-28")
data_test <- sales_ %>% filter(Store == 20) %>% filter(Date > "2012-09-28") %>% dplyr::select(-Store)FEATURES Dataset
Since The Retail Company separates the output on Features Dataset, I will check if there is huge differences with the Sales Dataset. If there is, we need to do further Data Manipulation based on the Business Objectives. Unfortunately, the Features dataset consists only 8,190 observations, much less than the Sales dataset. To conduct a meaningful data exploration, I will synchronize the data from each dataset further.
features_ <- features %>% dplyr::select(-IsHoliday) %>%
mutate(Store = as.factor(Store),
Date = as_date(dmy(Date)))
str(features_)## Classes 'tbl_df', 'tbl' and 'data.frame': 8190 obs. of 11 variables:
## $ Store : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2010-02-05" "2010-02-12" ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ MarkDown1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
## Store Date Temperature Fuel_Price MarkDown1 MarkDown2
## 0 0 0 0 4158 5269
## MarkDown3 MarkDown4 MarkDown5 CPI Unemployment
## 4577 4726 4140 585 585
There are huge amount of Missing Value in the MarkDown1-5 because the host company didn’t provide full data. Since MarkDown price is always happening and is the important variables, I will replace the Missing Data with zero which means no Markdown Price applied during the time.
features_clean <- features_ %>%
mutate(MarkDown1 = replace_na(MarkDown1, 0),
MarkDown2 = replace_na(MarkDown2, 0),
MarkDown3 = replace_na(MarkDown3, 0),
MarkDown4 = replace_na(MarkDown4, 0),
MarkDown5 = replace_na(MarkDown5, 0),
CPI = replace_na(CPI, 0),
Unemployment = replace_na(Unemployment, 0)
)
head(features_clean, 10)Since we know the Store 20 has the highest sales, I will subset the data containing Store 20 only and synchronize the Sales' andFeatures` dataset according to its time row.
featuresX <- features_clean %>%
group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date <= "2012-09-28") %>% ungroup()
feature_fin <- featuresX %>% dplyr::select(-Store) %>% column_to_rownames(., var = "Date")
data_test_feat <- features_clean %>%
group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date > "2012-09-28" & Date <= "2012-10-26") %>% ungroup() %>% dplyr::select(-Store)features__ <- features_clean %>%
group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date <= "2012-09-28") %>% ungroup() %>% dplyr::select(-Store)
sales_date <- cbind.data.frame(features__, salesX$Weekly_Sales)
colnames(sales_date)[colnames(sales_date) == 'salesX$Weekly_Sales'] <- 'weekly_sales'
write.csv(sales_date,"salesdate.csv", row.names = FALSE)Combining dataset and save it as csv file.
sales_campaign <- cbind.data.frame(feature_fin, salesX$Weekly_Sales)
colnames(sales_campaign)[colnames(sales_campaign) == 'salesX$Weekly_Sales'] <- 'weekly_sales'
sales_campaignLet’s see what insight can we get from the data. First of all, since I will try to forecast the weekly_sales target variable, I need to know which variables correlate positively to it. Most correlated variables will be scored 1 (tend to dark colour), and least correlated to -1 (tend to pink colour)
corr_plot <- ggcorr(sales_campaign, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3, low = "#cc0099", high = "black")
corr_plotOn this Density Plot, I will check some of the variables to see if the data distribute normally. If yes, the graphic will form a bell curve.
discount_plot <- ggplot(sales_campaign) +
geom_density(aes(MarkDown1),fill='pink') +
labs(title = "Data Distribution", x = NULL, y = "Density") +
theme_minimal()
ggplotly(discount_plot)Markdown1 variable returns zero on many of its observations. It is quite normal since a sales campaign (special discount) can never be applied everyday.
The CPI or Consumer Price Index normally shows variety numbers through the year of 2010 to 2012, starting from 203.6 to 215.73. The Consumer Price Index giving valuable influence to consumer behavior or the country’s inflation.
Create Time Series Object
Since the data is ordered by date time, it is important to change it into Time Series data type, so the Machine Learning will recognize the Trend, SEasonality, and the Random path.
## Parsed with column specification:
## cols(
## Temperature = col_double(),
## Fuel_Price = col_double(),
## MarkDown1 = col_double(),
## MarkDown2 = col_double(),
## MarkDown3 = col_double(),
## MarkDown4 = col_double(),
## MarkDown5 = col_double(),
## CPI = col_double(),
## Unemployment = col_double(),
## weekly_sales = col_double()
## )
To see what’s inside the Time Series data type, let’s decompose it.
Decomposition enable us to see whether there is Trend or Seasonality in the data behavior. It is important as well if Company to provide enough data (more observations, more duration) so the
decompose() function could catch Seasonality. Seasonality is the most valuable component of Time Series data because it help Machine Learning to train well on modelling in order to reduce data error possible.
Cross Validation
In order to review our model, I will split the data into two groups. Most of them will be trained, and the rest will be used to test the Machine Learning Model. This practice will save time and energy so we can be confident enough to choose the best Model for Forecasting our Target Variable.
For mkt_ts plot before, we can see that the trend is ‘additive’, and since we have trend and seasonality, one of the most suitable model will be Holt Winters.
Forecasting the hw_model
Model Evaluation
## [1] 2.417218
## [1] 49252.9
## [1] 59746.31
on Holts Winter model, the evaluation returning MAPE for 2.408682 which is great! It means the Mean Absolute Error or the total of all errors appears to be 2.40% compared to the whole data.
R-squared calculation
## [1] -404.4674
The R-squared returns a very huge number. Our model isn’’t reliable enough to forecast weekly_sales. This calculation shows the error area between our real data to the forecasted ones. If the R-squared is huge that means the data speaks so much error.
## Warning in plot_forecast(forecast_obj = hw_forecast, title = "Weekly Sales
## Forecast", : The value of the 'Xtitle' is not valid
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Forecasted weekly_sales numbers are:
## [1] 2125908 2059240 1924922 2045028
The plot shown above has a forecasted weekly_sales result within range of 95% confidence interval. But, since we need to attach other variables correlate to the Target Variable, a model who only calculate the weekly_sales cannot be realiable enough for us.
Assumption Test
Normality Residuals Check
##
## Shapiro-Wilk normality test
##
## data: residuals(hw_ts_model)
## W = 0.82982, p-value = 0.00000002339
H0: residuals distribute normally <- p-value > 0.05 H1: residuals not distribute normally <- p-value < 0.05
since the p-value < 0.05, the residuals distribute not normally.
No Autocorrelation
##
## Box-Pierce test
##
## data: residuals(hw_ts_model)
## X-squared = 1.1038, df = 1, p-value = 0.2934
H0: no-autocorrelation <- p-value > 0.05 H1: autocorrelation <- p-value < 0.05
the p-value returns 0.9568 > 0.05 so there is no autocorrelation
Since Time Series object is sometimes having hierarchical seasonality (from annually, quarterly, monthly, weekly, to daily -based on the data available), I will try another method to better catch the Trend and Seasonality.
multi <- msts(sales_campaign %>% dplyr::select(weekly_sales, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5, CPI),
seasonal.periods = c(4, 52))
mstl(multi[,"weekly_sales"]) %>% autoplot() This plot shows a two distinctive Seasonality of 4 and 52. Seasonality4 means the data has Quarterly pattern, and Seasonality52 has weekly pattern. With us catching better seasonlaity, with addition Regressors of our valuable variables, I’m sure the next Machine Laerning model will learn better and hopefully return minimum errors.
Cross Validation
model_arima <- stlm(train_msts[ , "weekly_sales"], method = "arima",
xreg = train_msts[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
write_rds(model_arima, path = "shinyandinadcd8/andina_dcd8/www/modelarima.rds")
forecast_arima <- forecast(model_arima, 4,newxreg = test_msts[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])Model Evaluation
## [1] 2.576265
## [1] 52818
## [1] 63194.73
## Warning in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L],
## deparse(substitute(e2))[1L]), : non-intersecting series
## [1] 100
The MAPE value return 2.576265, the hw_model gets better result slightly but model_arima catches the multi variables and the multi seasonality of weekly and monthly.
Plotting the forecast ARIMA with External Regressors
As a comparison to ARIMA models, I try another Machine Learning that can forecast Target Variables tagged along with regressors. But since it’s over-fitting on the result, I will share the process as only a comparison and not activating the codes.
Before we processing this further, we need to change the data into a timeseries object using as.ts()
We need to determine the lag length first, using VARselect(), we can see the minimum and maximum selection.
We need to check if there is serial correlation between the variables, the p-value must be > 0.05.
#var <- VAR(sales_ts_var[,1:10], p=8, type="const", season = NULL,
# exog = NULL)
#serial.test(var, lags.pt=10, type="PT.asymptotic")According to the Null Hypothesis:
H0: p-value < 00.5, means there is autocorrelation
H1: p-value > 0.005, means there is no autocorellation.
With the lag (p) = 8, we got the p-value < 0.05. So there is autocorrelation between the variables.
And now, time to forecast the variables
#forecast(var) %>%
#autoplot() + xlab("Year")
#pred_var <- predict(var, n.ahead = 8, ci= 0.95)
#fanchart(pred_var, names = "weekly_sales")
#fanchart(pred_var, names = "CPI")
#fanchart(pred_var, names = "MarkDown3")
#fanchart(pred_var, names = "MarkDown5")
#fanchart(pred_var, names = "Unemployment")The VAR model summarise resulting many equations I will only take one equation of weekly_sales here:
Estimation results for equation weekly_sales is as the following:
weekly_sales = Temperature.l1 + Fuel_Price.l1 + MarkDown1.l1 + MarkDown2.l1 + MarkDown3.l1 + MarkDown4.l1 + MarkDown5.l1 + CPI.l1 + Unemployment.l1 + weekly_sales.l1 + Temperature.l2 + Fuel_Price.l2 + MarkDown1.l2 + MarkDown2.l2 + MarkDown3.l2 + MarkDown4.l2 + MarkDown5.l2 + CPI.l2 + Unemployment.l2 + weekly_sales.l2 + Temperature.l3 + Fuel_Price.l3 + MarkDown1.l3 + MarkDown2.l3 + MarkDown3.l3 + MarkDown4.l3 + MarkDown5.l3 + CPI.l3 + Unemployment.l3 + weekly_sales.l3 + Temperature.l4 + Fuel_Price.l4 + MarkDown1.l4 + MarkDown2.l4 + MarkDown3.l4 + MarkDown4.l4 + MarkDown5.l4 + CPI.l4 + Unemployment.l4 + weekly_sales.l4 + Temperature.l5 + Fuel_Price.l5 + MarkDown1.l5 + MarkDown2.l5 + MarkDown3.l5 + MarkDown4.l5 + MarkDown5.l5 + CPI.l5 + Unemployment.l5 + weekly_sales.l5 + Temperature.l6 + Fuel_Price.l6 + MarkDown1.l6 + MarkDown2.l6 + MarkDown3.l6 + MarkDown4.l6 + MarkDown5.l6 + CPI.l6 + Unemployment.l6 + weekly_sales.l6 + Temperature.l7 + Fuel_Price.l7 + MarkDown1.l7 + MarkDown2.l7 + MarkDown3.l7 + MarkDown4.l7 + MarkDown5.l7 + CPI.l7 + Unemployment.l7 + weekly_sales.l7 + Temperature.l8 + Fuel_Price.l8 + MarkDown1.l8 + MarkDown2.l8 + MarkDown3.l8 + MarkDown4.l8 + MarkDown5.l8 + CPI.l8 + Unemployment.l8 + weekly_sales.l8 + const
The summary suggested that weekly_sales is highly influenced with the past data of itself of lag 1 and lag 5. The value of Adjusted R-squared: 0.1538, concerning that the variables is either too little to be forecasted or that not all variables are the best indicators.
var with test data#pred_var2 <- predict(var, data_test_feat, n.ahead = 4, ci = 0.95)
#fanchart(pred_var2, names = "weekly_sales")
#fanchart(pred_var2, names = "CPI")
#fanchart(pred_var2, names = "MarkDown3")
#fanchart(pred_var2, names = "MarkDown5")
#fanchart(pred_var2, names = "Unemployment")Making the weekly_sales forecast as data frame
#fc_var <- as.matrix(c(2499682, 2509975, 2475853, 2766539))
#MLmetrics::MAPE(fc_var, data_test$Weekly_Sales)*100the MAPE value of VAR model returns 21.83062 (was 41.85591)
This kind of result will jeopardize many aspect of Sales calculation since it is not reproducible to be applied in further newdata. But hopefully with this insight, we can select a better model to calculate newdata in shinny apps that I will build for you.
Finally we have reached the last chapter of this report. Let’s see if our chosen model reflects the best to unseen data which I use other store dataset. I take the next best stores to be trained with our chosen Model, and see if the forecast return the best as well.
## Parsed with column specification:
## cols(
## Store = col_double(),
## Date = col_date(format = ""),
## Temperature = col_double(),
## Fuel_Price = col_double(),
## MarkDown1 = col_double(),
## MarkDown2 = col_double(),
## MarkDown3 = col_double(),
## MarkDown4 = col_double(),
## MarkDown5 = col_double(),
## CPI = col_double(),
## Unemployment = col_double(),
## weekly_sales = col_double()
## )
allstores <- allstores %>% dplyr::select(-c("Temperature", "Fuel_Price", "Unemployment"))
store_A <- allstores %>% filter(Store == 4) %>% dplyr::select(-Store)
store_B <- allstores %>% filter(Store == 14) %>% dplyr::select(-Store)
store_C <- allstores %>% filter(Store == 13) %>% dplyr::select(-Store)
store_D <- allstores %>% filter(Store == 2) %>% dplyr::select(-Store)
store_E <- allstores %>% filter(Store == 10) %>% dplyr::select(-Store)
store_F <- allstores %>% filter(Store == 27) %>% dplyr::select(-Store)
store_G <- allstores %>% filter(Store == 6) %>% dplyr::select(-Store)
store_H <- allstores %>% filter(Store == 1) %>% dplyr::select(-Store)
store_I <- allstores %>% filter(Store == 39) %>% dplyr::select(-Store)
stores <- c("store_A", "store_B", "store_C", "store_D", "store_E", "store_F", "store_G", "store_H", "store_I")store_ts <- msts(store_A %>% dplyr::select(weekly_sales, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5, CPI),
seasonal.periods = c(4, 52))
train_store <- head(store_ts, n = length(store_ts) - 4)
test_store <- tail(store_ts, n = 4)
model_store <- stlm(train_store[ , "weekly_sales"], method = "arima",
xreg = train_store[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
forecast_store <- forecast(model_store, 4, newxreg = test_store[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
plot_forecast(forecast_obj = forecast_store, title = "Store Forecast", color = "#cc0099")To forecast sales can now be more realible with external variables highly correlated to our target data. If your company has more valuable data to tag along the sales result, this Machine Learning model can be applied with extra variables as regressors. We can also add if a marketing campaign has significant impact to the forecasted sales, or if you would like to see if external factors such as competitors’s campaign can affect your product sales. The key is to have better understanding on the data behaviors shown in Trend, Seasonalities, and the Regressors as well. And to make the model learn better, the more the data is provided, the better the Machine Learning models calculate the result. You can utilize this robust model to help you decide what’s next. Will we spend more on Marketing Budget, or do we need to hold up for a while, and allocate it to improve other aspect in your business importance.
The decision is now within your reach.