Sales Forecast

Background

For many years, a conservative retail sales promo’s decision depends on what has competitors been doing. Some brands can follow their competitor’s trail, but many has other considerations to down-scale the support budget, optimized it according to their particular audience. With many internal and external factors influence the sales like price MarkDown or Consumer Price Index, there should be a model to forecast a sales gain so in the future stakeholders can be more careful in supporting future’s Sales Promo.

us_retail <- read.csv("usdeptstoresales.csv")
us_retail <- as.data.frame(us_retail)
us_retail <- us_retail[1:36,]
us_retail <- as.data.frame(us_retail)
colnames(us_retail) <- "sales"
us_retail$sales <- as.numeric(gsub("\\,", "", us_retail$sales))

date = seq(from = as.Date("2010-01-01"), to = as.Date("2012-12-31"), by = 'month')
us_retail <- cbind(us_retail, date)
us_retail$month <- lubridate::month(us_retail$date, label = T)
us_retail$year <- as.character(lubridate::year(us_retail$date))

usretail_plot <- us_retail %>% 
  group_by(year, month) %>% 
  summarise(sales) %>% 
  ungroup()

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

plot_us <- ggplot(usretail_plot, aes(month, sales, group = year, # year diberi as.factor malah error
                                         color = year, text = glue("Year: {year}<br>Sales: {sales}") )) +
      geom_line() + geom_point() +
      scale_color_manual(values= c("#cc0099", "blue", "black", "grey")) + # scale_color_gradient(low="#cc0099", high="black")
      labs(title = "Total Sales US Retail - Dept Store") +
      theme_light()
    
    ggplotly(plot_us, tooltip = "text") %>% config(displayModeBar = F)

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

I’m interested to answer the question if Machine Learning could help a real case of Sales Promo based on Sales data by a store in the USA (Walmart), to see if the model can be applied to Indonesia’s Retail as well. This can be a great consideration for any brand or retail can use a pretty wise budget support fort them confidently running the sales promo, or, to hold any promo if there is a predictable variable that giving a bad influence to the model, making a bad sales.

In this project I will try to predict on how effective a price markdown to increase sales using Retail Dataset from Kaggle here https://www.kaggle.com/manjeetsingh/retaildataset. The dataset consists of historical sales data of 45 stores with each one containing a number of departments. The company also runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. The dataset is ordinarily an excel sheet, with 3 tabs but Manjeet Singh has changed it into three csv format: Stores, Features, Sales. I will evaluate further whether or not to use whicch variables that is very significant in the Explorating Data Analysis chapter.

But first, let us peek a little about the data here:

Importing Data from Kaggle

features <- read_csv("retaildataset/Features data set.csv")
sales <-  read_csv("retaildataset/sales data-set.csv")

Data Descrption

Here are what is in each of the column:

FEATURES
- Store - the store number
- Date - the week
- MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA
- CPI - the consumer price index
- Unemployment - the unemployment rate
- IsHoliday -whether the week is a special holiday week

SALES
Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the following fields:
- Store - the store number
- Dept - the department number
- Date - the week
- Weekly_Sales - sales for the given department in the given store
- IsHoliday - whether the week is a special holiday week

Since the Storesdataset didn’t give meaningful information about the seasonal sales, I’m going to take it down. Also, the Features and Sales dataset has different row, so I will pre-processing the data later and combine them all.

Preprocessing Data

SALES dataset

I’m mutating the Store, Dep, and Date variables into their own suitable type of data, and see the structure of the data which returns 421,570 data observations with 4 variables:

sales_ <- sales %>% 
  dplyr::select(-IsHoliday) %>% 
  mutate(Store = as.factor(Store),
         Dept = as.factor(Dept),
         Date = as_date(dmy(Date)))
str(sales_)

## Classes 'tbl_df', 'tbl' and 'data.frame':    421570 obs. of  4 variables:
##  $ Store       : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : Factor w/ 81 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date, format: "2010-02-05" "2010-02-12" ...
##  $ Weekly_Sales: num  24924 46039 41596 19404 21828 ...

colSums(is.na(sales_))

##        Store         Dept         Date Weekly_Sales 
##            0            0            0            0

sales_ <- sales_ %>% 
  group_by(Store,Date) %>% 
  summarise(Weekly_Sales = sum(Weekly_Sales)) %>% 
  ungroup()

Since we will forecasting the target variables weekly_sales, it is natural if we process further which store has the highest sales so we can teach the Machine Learning to examine the fittest model. Let’s see which Store has the highest Sales:

max_store <- sales_ %>% dplyr::select(Store, Weekly_Sales) %>%  group_by(Store) %>% summarise(Weekly_Sales = sum(Weekly_Sales)) %>% ungroup() %>% arrange(Weekly_Sales)

## `summarise()` ungrouping output (override with `.groups` argument)

tail(max_store)

It says that Store 20 has the highest sales, so I will use only Store 20 data. I will use other store’s weekly_sales data after we get the best model to Forecast Sales.

Since I need some of the data to be tested after I get the best model, I will subset the last 2 months and save it for further exploration.

salesX <- sales_ %>% filter(Store == 20) %>% filter(Date <= "2012-09-28")

data_test <- sales_ %>% filter(Store == 20) %>% filter(Date > "2012-09-28") %>% dplyr::select(-Store)

FEATURES Dataset

Since The Retail Company separates the output on Features Dataset, I will check if there is huge differences with the Sales Dataset. If there is, we need to do further Data Manipulation based on the Business Objectives. Unfortunately, the Features dataset consists only 8,190 observations, much less than the Sales dataset. To conduct a meaningful data exploration, I will synchronize the data from each dataset further.

features_ <- features %>% dplyr::select(-IsHoliday) %>% 
  mutate(Store = as.factor(Store),
         Date = as_date(dmy(Date)))
str(features_)

## Classes 'tbl_df', 'tbl' and 'data.frame':    8190 obs. of  11 variables:
##  $ Store       : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : Date, format: "2010-02-05" "2010-02-12" ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...

colSums((is.na(features_)))

##        Store         Date  Temperature   Fuel_Price    MarkDown1    MarkDown2 
##            0            0            0            0         4158         5269 
##    MarkDown3    MarkDown4    MarkDown5          CPI Unemployment 
##         4577         4726         4140          585          585

There are huge amount of Missing Value in the MarkDown1-5 because the host company didn’t provide full data. Since MarkDown price is always happening and is the important variables, I will replace the Missing Data with zero which means no Markdown Price applied during the time.

features_clean <- features_ %>%
  mutate(MarkDown1 = replace_na(MarkDown1, 0),
                     MarkDown2 = replace_na(MarkDown2, 0),
                     MarkDown3 = replace_na(MarkDown3, 0),
                     MarkDown4 = replace_na(MarkDown4, 0),
                     MarkDown5 = replace_na(MarkDown5, 0),
                     CPI = replace_na(CPI, 0),
                     Unemployment = replace_na(Unemployment, 0)
                     ) 
head(features_clean, 10)

Since we know the Store 20 has the highest sales, I will subset the data containing Store 20 only and synchronize the Sales' andFeatures` dataset according to its time row.

featuresX <- features_clean %>%
  group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date <= "2012-09-28") %>% ungroup()
feature_fin <- featuresX %>%  dplyr::select(-Store) %>% column_to_rownames(., var = "Date")

data_test_feat <- features_clean %>%
  group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date > "2012-09-28" & Date <= "2012-10-26") %>% ungroup() %>% dplyr::select(-Store)

features__ <- features_clean %>%
  group_by(Store, Date) %>% filter(Store == 20) %>% filter(Date <= "2012-09-28") %>% ungroup() %>% dplyr::select(-Store)

sales_date <- cbind.data.frame(features__, salesX$Weekly_Sales)
colnames(sales_date)[colnames(sales_date) == 'salesX$Weekly_Sales'] <- 'weekly_sales'

write.csv(sales_date,"salesdate.csv", row.names = FALSE)

Combining dataset and save it as csv file.

sales_campaign <- cbind.data.frame(feature_fin, salesX$Weekly_Sales)
colnames(sales_campaign)[colnames(sales_campaign) == 'salesX$Weekly_Sales'] <- 'weekly_sales'

sales_campaign

write.csv(sales_campaign,"salesretail.csv", row.names = FALSE)

Data Exploration

Let’s see what insight can we get from the data. First of all, since I will try to forecast the weekly_sales target variable, I need to know which variables correlate positively to it. Most correlated variables will be scored 1 (tend to dark colour), and least correlated to -1 (tend to pink colour)

corr_plot <- ggcorr(sales_campaign, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3, low = "#cc0099", high = "black")
corr_plot

On this Density Plot, I will check some of the variables to see if the data distribute normally. If yes, the graphic will form a bell curve.

discount_plot <- ggplot(sales_campaign) +
    geom_density(aes(MarkDown1),fill='pink') +
   labs(title = "Data Distribution", x = NULL, y = "Density") +
  theme_minimal()
ggplotly(discount_plot)

The Markdown1 variable returns zero on many of its observations. It is quite normal since a sales campaign (special discount) can never be applied everyday.

cpi_plot <- ggplot(sales_campaign) +
    geom_density(aes(CPI),fill='pink') +
   labs(title = "Data Distribution", x = NULL, y = "Density") +
  theme_minimal()
ggplotly(cpi_plot)

The CPI or Consumer Price Index normally shows variety numbers through the year of 2010 to 2012, starting from 203.6 to 215.73. The Consumer Price Index giving valuable influence to consumer behavior or the country’s inflation.

dens_plot <- ggplot(sales_campaign) +
    geom_density(aes(Temperature),fill='pink') +
   labs(title = "Data Distribution", x = NULL, y = "Density") +
  theme_minimal()
ggplotly(dens_plot)

Machine Learning Model

Data Preparation

Create Time Series Object

Since the data is ordered by date time, it is important to change it into Time Series data type, so the Machine Learning will recognize the Trend, SEasonality, and the Random path.

sales_campaign <- read_csv("salesretail.csv")

## Parsed with column specification:
## cols(
##   Temperature = col_double(),
##   Fuel_Price = col_double(),
##   MarkDown1 = col_double(),
##   MarkDown2 = col_double(),
##   MarkDown3 = col_double(),
##   MarkDown4 = col_double(),
##   MarkDown5 = col_double(),
##   CPI = col_double(),
##   Unemployment = col_double(),
##   weekly_sales = col_double()
## )

sales_ts <- ts(data = sales_campaign$weekly_sales, frequency= 52)
#ts_info(sales_ts)

To see what’s inside the Time Series data type, let’s decompose it.

sales_dc <- decompose(sales_ts)
autoplot(sales_dc)

Decomposition enable us to see whether there is Trend or Seasonality in the data behavior. It is important as well if Company to provide enough data (more observations, more duration) so the decompose() function could catch Seasonality. Seasonality is the most valuable component of Time Series data because it help Machine Learning to train well on modelling in order to reduce data error possible.

Cross Validation

In order to review our model, I will split the data into two groups. Most of them will be trained, and the rest will be used to test the Machine Learning Model. This practice will save time and energy so we can be confident enough to choose the best Model for Forecasting our Target Variable.

train_ts <- head(sales_ts, n = length(sales_ts) - 4)
test_ts <- tail(sales_ts, n = 4)

Holts Winter Model

For mkt_ts plot before, we can see that the trend is ‘additive’, and since we have trend and seasonality, one of the most suitable model will be Holt Winters.

hw_ts_model <- HoltWinters(x = train_ts, seasonal = "additive")

Forecasting the hw_model

hw_forecast <- forecast(object = hw_ts_model, h = length(test_ts))

Model Evaluation

MLmetrics::MAPE(y_pred = hw_forecast$mean, y_true = test_ts)*100

## [1] 2.417218

MAE(hw_forecast$mean %>% as.numeric(), test_ts)

## [1] 49252.9

RMSE(hw_forecast$mean %>% as.numeric(), test_ts)

## [1] 59746.31

on Holts Winter model, the evaluation returning MAPE for 2.408682 which is great! It means the Mean Absolute Error or the total of all errors appears to be 2.40% compared to the whole data.

R-squared calculation

R2 <- 1 - (sum((test_ts - hw_forecast$mean)^2)/sum((test_ts - mean(test_ts))^2))
R2*100

## [1] -404.4674

The R-squared returns a very huge number. Our model isn’’t reliable enough to forecast weekly_sales. This calculation shows the error area between our real data to the forecasted ones. If the R-squared is huge that means the data speaks so much error.

plot_forecast(forecast_obj = hw_forecast, title = "Weekly Sales Forecast", color = "pink", labels)

## Warning in plot_forecast(forecast_obj = hw_forecast, title = "Weekly Sales
## Forecast", : The value of the 'Xtitle' is not valid

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Forecasted weekly_sales numbers are:

hw_forecast$mean[1:4]

## [1] 2125908 2059240 1924922 2045028

The plot shown above has a forecasted weekly_sales result within range of 95% confidence interval. But, since we need to attach other variables correlate to the Target Variable, a model who only calculate the weekly_sales cannot be realiable enough for us.

Assumption Test

Normality Residuals Check

shapiro.test(residuals(hw_ts_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(hw_ts_model)
## W = 0.82982, p-value = 0.00000002339

H0: residuals distribute normally <- p-value > 0.05 H1: residuals not distribute normally <- p-value < 0.05

since the p-value < 0.05, the residuals distribute not normally.

No Autocorrelation

Box.test(x = residuals(hw_ts_model))

## 
##  Box-Pierce test
## 
## data:  residuals(hw_ts_model)
## X-squared = 1.1038, df = 1, p-value = 0.2934

H0: no-autocorrelation <- p-value > 0.05 H1: autocorrelation <- p-value < 0.05

the p-value returns 0.9568 > 0.05 so there is no autocorrelation

Multi Seasonality Model

Since Time Series object is sometimes having hierarchical seasonality (from annually, quarterly, monthly, weekly, to daily -based on the data available), I will try another method to better catch the Trend and Seasonality.

multi <- msts(sales_campaign %>% dplyr::select(weekly_sales, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5, CPI),
                    seasonal.periods = c(4, 52))
mstl(multi[,"weekly_sales"]) %>% autoplot()

This plot shows a two distinctive Seasonality of 4 and 52. Seasonality4 means the data has Quarterly pattern, and Seasonality52 has weekly pattern. With us catching better seasonlaity, with addition Regressors of our valuable variables, I’m sure the next Machine Laerning model will learn better and hopefully return minimum errors.

Cross Validation

train_msts <- head(multi, n = length(multi) - 4)
test_msts <- tail(multi, n = 4)

ARIMA Model with External Regressors

model_arima <- stlm(train_msts[ , "weekly_sales"], method = "arima",
                    xreg = train_msts[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
write_rds(model_arima, path = "shinyandinadcd8/andina_dcd8/www/modelarima.rds")

forecast_arima <- forecast(model_arima, 4,newxreg = test_msts[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])

Model Evaluation

MLmetrics::MAPE(forecast_arima$mean %>% as.numeric(), test_msts[ , "weekly_sales"])*100

## [1] 2.576265

MAE(forecast_arima$mean %>% as.numeric(), test_msts[ , "weekly_sales"])

## [1] 52818

RMSE(forecast_arima$mean %>% as.numeric(), test_msts[ , "weekly_sales"])

## [1] 63194.73

Rsquared <- 1 - (sum((test_msts[ , "weekly_sales"] - forecast_arima$mean)^2)/sum((test_msts[ , "weekly_sales"] - mean(test_msts[ , "weekly_sales"]))^2))

## Warning in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L],
## deparse(substitute(e2))[1L]), : non-intersecting series

Rsquared*100

## [1] 100

The MAPE value return 2.576265, the hw_model gets better result slightly but model_arima catches the multi variables and the multi seasonality of weekly and monthly.

Plotting the forecast ARIMA with External Regressors

plot_forecast(forecast_obj = forecast_arima, title = "Forecast ARIMA Model", color = "#cc0099")

Vector Auto Regression (VAR)

As a comparison to ARIMA models, I try another Machine Learning that can forecast Target Variables tagged along with regressors. But since it’s over-fitting on the result, I will share the process as only a comparison and not activating the codes.

Before we processing this further, we need to change the data into a timeseries object using as.ts()

#sales_ts_var <- as.ts(sales_campaign)

We need to determine the lag length first, using VARselect(), we can see the minimum and maximum selection.

#VARselect(sales_ts_var[,1:10], lag.max=8,
#  type="const")[["selection"]]

We need to check if there is serial correlation between the variables, the p-value must be > 0.05.

#var <- VAR(sales_ts_var[,1:10], p=8, type="const", season = NULL, 
#    exog = NULL)
#serial.test(var, lags.pt=10, type="PT.asymptotic")

According to the Null Hypothesis:
H0: p-value < 00.5, means there is autocorrelation
H1: p-value > 0.005, means there is no autocorellation.

With the lag (p) = 8, we got the p-value < 0.05. So there is autocorrelation between the variables.

And now, time to forecast the variables

#forecast(var) %>% 
#autoplot() + xlab("Year")

#pred_var <- predict(var, n.ahead = 8, ci= 0.95)
#fanchart(pred_var, names = "weekly_sales")
#fanchart(pred_var, names = "CPI")
#fanchart(pred_var, names = "MarkDown3")
#fanchart(pred_var, names = "MarkDown5")
#fanchart(pred_var, names = "Unemployment")

# summary(var)

The VAR model summarise resulting many equations I will only take one equation of weekly_sales here:

Estimation results for equation weekly_sales is as the following:

weekly_sales = Temperature.l1 + Fuel_Price.l1 + MarkDown1.l1 + MarkDown2.l1 + MarkDown3.l1 + MarkDown4.l1 + MarkDown5.l1 + CPI.l1 + Unemployment.l1 + weekly_sales.l1 + Temperature.l2 + Fuel_Price.l2 + MarkDown1.l2 + MarkDown2.l2 + MarkDown3.l2 + MarkDown4.l2 + MarkDown5.l2 + CPI.l2 + Unemployment.l2 + weekly_sales.l2 + Temperature.l3 + Fuel_Price.l3 + MarkDown1.l3 + MarkDown2.l3 + MarkDown3.l3 + MarkDown4.l3 + MarkDown5.l3 + CPI.l3 + Unemployment.l3 + weekly_sales.l3 + Temperature.l4 + Fuel_Price.l4 + MarkDown1.l4 + MarkDown2.l4 + MarkDown3.l4 + MarkDown4.l4 + MarkDown5.l4 + CPI.l4 + Unemployment.l4 + weekly_sales.l4 + Temperature.l5 + Fuel_Price.l5 + MarkDown1.l5 + MarkDown2.l5 + MarkDown3.l5 + MarkDown4.l5 + MarkDown5.l5 + CPI.l5 + Unemployment.l5 + weekly_sales.l5 + Temperature.l6 + Fuel_Price.l6 + MarkDown1.l6 + MarkDown2.l6 + MarkDown3.l6 + MarkDown4.l6 + MarkDown5.l6 + CPI.l6 + Unemployment.l6 + weekly_sales.l6 + Temperature.l7 + Fuel_Price.l7 + MarkDown1.l7 + MarkDown2.l7 + MarkDown3.l7 + MarkDown4.l7 + MarkDown5.l7 + CPI.l7 + Unemployment.l7 + weekly_sales.l7 + Temperature.l8 + Fuel_Price.l8 + MarkDown1.l8 + MarkDown2.l8 + MarkDown3.l8 + MarkDown4.l8 + MarkDown5.l8 + CPI.l8 + Unemployment.l8 + weekly_sales.l8 + const

The summary suggested that weekly_sales is highly influenced with the past data of itself of lag 1 and lag 5. The value of Adjusted R-squared: 0.1538, concerning that the variables is either too little to be forecasted or that not all variables are the best indicators.

Predicting `var` with test data

#pred_var2 <- predict(var, data_test_feat, n.ahead = 4, ci = 0.95)
#fanchart(pred_var2, names = "weekly_sales")
#fanchart(pred_var2, names = "CPI")
#fanchart(pred_var2, names = "MarkDown3")
#fanchart(pred_var2, names = "MarkDown5")
#fanchart(pred_var2, names = "Unemployment")

Making the weekly_sales forecast as data frame

#fc_var <- as.matrix(c(2499682, 2509975, 2475853, 2766539))
#MLmetrics::MAPE(fc_var, data_test$Weekly_Sales)*100

the MAPE value of VAR model returns 21.83062 (was 41.85591)

This kind of result will jeopardize many aspect of Sales calculation since it is not reproducible to be applied in further newdata. But hopefully with this insight, we can select a better model to calculate newdata in shinny apps that I will build for you.

Testing the Best Model to Unseen Data

Finally we have reached the last chapter of this report. Let’s see if our chosen model reflects the best to unseen data which I use other store dataset. I take the next best stores to be trained with our chosen Model, and see if the forecast return the best as well.

Newdata Preparation

# Store with High sales 4, 14, 13, 2, 10, 27, 6, 1, 39
allstores <- read_csv("allstores.csv")

## Parsed with column specification:
## cols(
##   Store = col_double(),
##   Date = col_date(format = ""),
##   Temperature = col_double(),
##   Fuel_Price = col_double(),
##   MarkDown1 = col_double(),
##   MarkDown2 = col_double(),
##   MarkDown3 = col_double(),
##   MarkDown4 = col_double(),
##   MarkDown5 = col_double(),
##   CPI = col_double(),
##   Unemployment = col_double(),
##   weekly_sales = col_double()
## )

allstores <- allstores %>% dplyr::select(-c("Temperature", "Fuel_Price", "Unemployment"))

store_A <- allstores %>% filter(Store == 4) %>% dplyr::select(-Store)
store_B <- allstores %>% filter(Store == 14) %>% dplyr::select(-Store)
store_C <- allstores %>% filter(Store == 13) %>% dplyr::select(-Store)
store_D <- allstores %>% filter(Store == 2) %>% dplyr::select(-Store)
store_E <- allstores %>% filter(Store == 10) %>% dplyr::select(-Store)
store_F <- allstores %>% filter(Store == 27) %>% dplyr::select(-Store)
store_G <- allstores %>% filter(Store == 6) %>% dplyr::select(-Store)
store_H <- allstores %>% filter(Store == 1) %>% dplyr::select(-Store)
store_I <- allstores %>% filter(Store == 39) %>% dplyr::select(-Store)

stores <- c("store_A", "store_B", "store_C", "store_D", "store_E", "store_F", "store_G", "store_H", "store_I")

Train Model and Forecast Newdata

store_ts <- msts(store_A %>% dplyr::select(weekly_sales, MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5, CPI),
                     seasonal.periods = c(4, 52))
    
    train_store <- head(store_ts, n = length(store_ts) - 4)
    test_store <- tail(store_ts, n = 4)
    
    model_store <- stlm(train_store[ , "weekly_sales"], method = "arima",
                        xreg = train_store[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
    
    forecast_store <- forecast(model_store, 4, newxreg = test_store[ , c("MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "CPI")])
    
    plot_forecast(forecast_obj = forecast_store, title = "Store Forecast", color = "#cc0099")

Looking good! The model surely can be applied to newdata (other stores datasets) as we can see it perfoms really well here. If you want to have better experience understanding this report, I suggest you to visit my shinyapps dashboard report here: https://dinnah88.shinyapps.io/sales_forecast/

write_rds(train_store, path = "shinyandinadcd8/andina_dcd8/www/trainstore.rds")
write_rds(test_store, path = "shinyandinadcd8/andina_dcd8/www/teststore.rds")
write_rds(forecast_store, path = "shinyandinadcd8/andina_dcd8/www/forecaststore.rds")

SUMMARY

To forecast sales can now be more realible with external variables highly correlated to our target data. If your company has more valuable data to tag along the sales result, this Machine Learning model can be applied with extra variables as regressors. We can also add if a marketing campaign has significant impact to the forecasted sales, or if you would like to see if external factors such as competitors’s campaign can affect your product sales. The key is to have better understanding on the data behaviors shown in Trend, Seasonalities, and the Regressors as well. And to make the model learn better, the more the data is provided, the better the Machine Learning models calculate the result. You can utilize this robust model to help you decide what’s next. Will we spend more on Marketing Budget, or do we need to hold up for a while, and allocate it to improve other aspect in your business importance.

The decision is now within your reach.

Personal Info

LinkedIn: https://www.linkedin.com/in/andinaseptia/

DCD 8 Retail Sales Promo

Andina Septia

3/20/2020

Sales Forecast

Background

Data Descrption

Preprocessing Data

Data Exploration

Machine Learning Model

Data Preparation

Holts Winter Model

Multi Seasonality Model

ARIMA Model with External Regressors

Vector Auto Regression (VAR)

Predicting `var` with test data

Testing the Best Model to Unseen Data

Newdata Preparation

Train Model and Forecast Newdata

SUMMARY

Personal Info

DCD 8 Retail Sales Promo

Andina Septia

3/20/2020

Sales Forecast

Background

Data Descrption

Preprocessing Data

Data Exploration

Machine Learning Model

Data Preparation

Holts Winter Model

Multi Seasonality Model

ARIMA Model with External Regressors

Vector Auto Regression (VAR)

Predicting var with test data

Testing the Best Model to Unseen Data

Newdata Preparation

Train Model and Forecast Newdata

SUMMARY

Personal Info

Predicting `var` with test data