#Analysis of Superstore Orders ##by Matthew Luckey

BACKGROUND The purpose of this file is to analyze a dataset, determine trends, seasonality, and derive a forecast. Skills learned are from Diogo Alves de Resende. The below analysis is only a small portion of what I learned, and if you are interested in learning more, his class is highly recommended. To find his course, go to Udemy.com and search for “Forecasting Models and Time Series for Business in R”. Additional skills and understanding were gathered from Factory Physics For Managers by Ed Pound, Jeffrey Bell, and Mark Spearman. Statistics were refreshed by utilizing Khan Academy. The dataset was derived from Kaggle.com Sample_Superstore

In business we already know that the forecast is always going to be wrong. This does not mean that we can not improve the forecast. The three rules of forecasting are:

  1. The forecast is always going to be wrong
  2. Aggregate forecasts are more accurate than specific forecasts
  3. The further into the future you forecast, the larger your error will become

EXERCISE Sample Superstore ships office supplies and miscellaneous equipment across the lower 48 states. Recently they have missed their Perfect Order Fulfillment promise to their customers, by not being able to deliver at the right time. This was caused by Sample Superstore not having a good understanding of the trends and seasonality that occurs weekly and monthly at their locations.

They have provided data from 03-Jan-2014 to 30-Dec-2017. With this data they have asked to determine trends, seasonality, and build a better forecast that their business can leverage in the future.

ASK Understanding what Sample Superstore needs, we will need to develop a list of questions to ask, in order to meet the companies request.

  1. What does the trend look like for Sample Superstore?
  2. Is there seasonality, and if yes, is this daily, weekly, monthly?
  3. What does the forecast look like at an aggregate over the next 6 months?
  4. In general, are there any other quick interesting facts the data can tell us?

PREPARE Load the packages which will help with the modeling. Note if you do not have the packages loaded into your system, you will need to install.packages().

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(prophet)
## Warning: package 'prophet' was built under R version 4.1.3
## Loading required package: Rcpp
## Loading required package: rlang
## 
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, splice
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Bring in the data that the customer provided. They provided a CSV file with ten thousand observations from 2014-2017.

raw_df <- read.csv("Data/sample_superstore_data.csv")

#We will want to explore how the data is formatted
str(raw_df)
## 'data.frame':    9994 obs. of  24 variables:
##  $ Row.ID       : int  7981 740 741 742 1760 5328 7181 7475 7476 7477 ...
##  $ Order.ID     : chr  "CA-2014-103800" "CA-2014-112326" "CA-2014-112326" "CA-2014-112326" ...
##  $ Order.Date   : chr  "1/3/2014" "1/4/2014" "1/4/2014" "1/4/2014" ...
##  $ Ship.Date    : chr  "1/7/2014" "1/8/2014" "1/8/2014" "1/8/2014" ...
##  $ Ship.Mode    : chr  "Standard Class" "Standard Class" "Standard Class" "Standard Class" ...
##  $ Customer.ID  : chr  "DP-13000" "PO-19195" "PO-19195" "PO-19195" ...
##  $ Customer.Name: chr  "Darren Powers" "Phillina Ober" "Phillina Ober" "Phillina Ober" ...
##  $ Segment      : chr  "Consumer" "Home Office" "Home Office" "Home Office" ...
##  $ Country      : chr  "United States" "United States" "United States" "United States" ...
##  $ City         : chr  "Houston" "Naperville" "Naperville" "Naperville" ...
##  $ State        : chr  "Texas" "Illinois" "Illinois" "Illinois" ...
##  $ Postal.Code  : int  77095 60540 60540 60540 19143 90049 30605 42420 42420 42420 ...
##  $ Region       : chr  "Central" "Central" "Central" "Central" ...
##  $ Product.ID   : chr  "OFF-PA-10000174" "OFF-LA-10003223" "OFF-ST-10002743" "OFF-BI-10004094" ...
##  $ Category     : chr  "Office Supplies" "Office Supplies" "Office Supplies" "Office Supplies" ...
##  $ Sub.Category : chr  "Paper" "Labels" "Storage" "Binders" ...
##  $ Product.Name : chr  "Message Book, Wirebound, Four 5 1/2\" X 4\" Forms/Pg., 200 Dupl. Sets/Book" "Avery 508" "SAFCO Boltless Steel Shelving" "GBC Standard Plastic Binding Systems Combs" ...
##  $ Sales        : num  16.45 11.78 272.74 3.54 19.54 ...
##  $ Quantity     : int  2 3 3 2 3 3 3 9 2 2 ...
##  $ Discount     : num  0.2 0.2 0.2 0.8 0.2 0 0 0 0 0 ...
##  $ Profit       : num  5.55 4.27 -64.77 -5.49 4.88 ...
##  $ MarchSale    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NovemberSale : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DecemberSale : int  0 0 0 0 0 0 0 0 0 0 ...

PROCESS Now that we have brought the data in, and began viewing it, we need to clean it up, so we can build our model off of it. This data set will allow us to do a variety of analysis, but the focus will be on the forecasting model for this project. Avoiding confusion, and keeping focus on the task.

#Transform the Date Variables
raw_df$Order.Date <- as.character(raw_df$Order.Date)
raw_df$Order.Date <- mdy(raw_df$Order.Date)

raw_df$Ship.Date <- as.character(raw_df$Ship.Date)
raw_df$Ship.Date <- mdy(raw_df$Ship.Date)

#Select the columns to analyze
df <- select(raw_df,"Order.Date", "Segment", "State", "Region", "Sales", "Quantity",
             "Category","Sub.Category")

Now do some simple exploration to determine which categories land where, and from which region.

ggplot(df, aes(x=Region, y=Sales, color=Category))+
  geom_point(position="jitter")

Since we are looking at quantity of items being sold, we will want to narrow our scope to the item being observed. Primarily, because you cant put quantity of office supplies with furniture. The low office supplies cost may throw the larger furniture cost off, and vice versa.

df <- filter(df, Category=="Office Supplies")

Now that we have the columns we want and the date in the right format, we can do some basic summary actions before plotting, and transforming into a time series format.

df <- df %>% 
  group_by(Order.Date) %>% 
  summarise(Sales=sum(Sales))

Not all of the dates are consecutive, and when we transform the data into a time series, it is thrown off by this. Lets fill in the missing dates with values.

df <- df %>%
  mutate(Date = as.Date(Order.Date)) %>%
  complete(Order.Date = seq.Date(min(Order.Date), max(Order.Date), by="day"))
df$Date=NULL

The Facebook Prophet Forecasting Model requires data to be entered in a very specific format. This means that our column names need to change in order to fit this format

df <- rename(df, y=Sales)
df <- rename(df, ds=Order.Date)

When doing some exploratory analysis early on, the data from 2014, was not as accurate as the most recent data. For this reason, we will filter it out

df <- filter(df, ds>="2014-12-31")

If there are NA values for a date, we take the mean of the dataset
and fill that into the missing areas, so as to not throw off the average. A better way to do this may exist, but for the purpose of the exercise, its okay.

df[is.na(df)] <- 1

ANALYZE This next section will be analyzing the initial data, and doing some further processing to build the forecasting model we want to use.

ggplot(df, aes(x=ds, y=y))+
  geom_line()+
  xlab("Time")+
  ylab("Sales")+
  theme(text=element_text(size=12))+
  scale_x_date(date_labels="%Y %b")

We can see a few things while looking at this chart: 1. Data is being plotted from January 2015 - January 2018 2. The largest Orders occurred around September of 2016 3. It appears that the volume is increasing, but difficult to really tell

Next we will transform the data into a time series data and plot.

y = ts(data=df$y,
       start = c(2014,1),
       frequency = 365.25)
plot.ts(y)

This is a replotting of the data in a time series format to ensure that everything looks correct to the non-time series data.

The next chart plotted will break down the time series data into a 365 day clock. This will allow patterns to be seen in the data quickly. If we wanted to add holidays or regressors to our model, this would be one way to determine where these need to be.

ggseasonplot(x=y,
             main="Seasonality Graph",
             polar=TRUE)

While this becomes a bit busy using daily data, we can determine a few things: 1. There is a spike in the June timeframe 2. Overall sales seem relatively consistent 3. We have spikes at periodic times throughout the year

The next graph will break down the time series into a the data, trend, seasonality, and error.

decomposition_multiplicative <- decompose(x=y,
                                          type="additive")

plot (decomposition_multiplicative)

This tells us the following:

Now it is time to begin building our forecast training and test sets. This will allow us to build our model, test against historical data, and then run against the future value.

training <- df %>% filter(ds <"2017-01-01")
test <- df %>% filter(ds>="2017-01-01")

At the time, I will not include holidays, but did want to still build out the model

m <-  prophet(yearly.seasonality = TRUE,
            weekly.seasonality = TRUE,
            daily.seasonality = FALSE,
            seasonality.mode = 'additive',
            seasonality.prior.scale = 10,
            changepoint.prior.scale = 0.01)
m <-  fit.prophet(m, training)

The power of the Prophet model, is the customization available. The daily seasonality is FALSE because we do not have hourly data. The seasonality numbers are telling prophet how to measure seasonality, and how many change points should be detected.

The next thing to look at is building the future dataset based off of the number of rows from the test dataframe. Then do a bit of processing.

future = make_future_dataframe(m,
                               periods = nrow(test))

#Change date type of future
future$ds <- ymd(future$ds)

#Remove NA Values from future
future <- future %>% drop_na

#Forecasting
forecast = predict(m, future)

#visualization
plot(m, forecast)

This has broken our forecast model down. The black dots are the actual data points. The blue line is our point forecast, and the light blue shaded area are the confidence intervals. This is not perfect, and we will try and improve.

Break down the components of this model:

prophet_plot_components(m, forecast)

predictions = tail(forecast$yhat, nrow(test))
accuracy(predictions, test$y)
##                ME     RMSE      MAE       MPE     MAPE
## Test set 90.93778 962.8743 563.0724 -3801.103 4265.139

This tells us the following: - The trend is increasing - Weekly sales dip on Wednesday and Saturday - We have some wild monthly seasonality

Now I want to run a loop on this same data to determine what the best option is for the parameters we plug into our model. (WARNING: Time Intense)

First build out a grid which will house all of the options

prophetGrid <- expand.grid(changepoint_prior_scale = c(0.01, 0.05, 0.1, 0.15),
                        seasonality_prior_scale = c(5,10,15,20,25,30),
                        seasonality.mode = c("multiplicative", "additive"))

This has given us 48 different scenarios to run through. Now we will build a vector in which our data results can be input into

results <-  vector(mode = "numeric",
                 length = nrow(prophetGrid))

Now build the loop to run through all 48 of the options

for (i in 1:nrow(prophetGrid)) {
  
  #Fetch Parameters
  parameters = prophetGrid[i, ]
  
  #Build the model
  m = prophet(yearly.seasonality = TRUE,
              weekly.seasonality = TRUE,
              daily.seasonality = FALSE,
              seasonality.mode = parameters$seasonality.mode,
              seasonality.prior.scale = parameters$seasonality_prior_scale,
              holidays.prior.scale = parameters$holidays_prior_scale,
              changepoint.prior.scale = parameters$changepoint_prior_scale)
  m=fit.prophet(m,df)
  
  #Cross-Validation
  df.cv <- cross_validation(model=m,
                         horizon=7,
                         units= "days",
                         period=14,
                         initial = 731)
    #Store the results
  results[i] <- accuracy(df.cv$yhat, df.cv$y)[1, 2]
  print(i)
}
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 1
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 2
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 3
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 4
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 5
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 6
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 7
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 8
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 9
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 10
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 11
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 12
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 13
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 14
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 15
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 16
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 17
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 18
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 19
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 20
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 21
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 22
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 23
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 24
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 25
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 26
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 27
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 28
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 29
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 30
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 31
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 32
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 33
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 34
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 35
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 36
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 37
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 38
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 39
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 40
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 41
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 42
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 43
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 44
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 45
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 46
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 47
## Making 26 forecasts with cutoffs between 2017-01-07 and 2017-12-23
## [1] 48

Now that we have this data, we need to combine with our results and determine which parameters are best to use for our future model.

prophetGrid <- cbind(prophetGrid, results)
best_parameters <- prophetGrid[prophetGrid$results==min(results), ] 

The best parameter yielding the lowest RMSE was the following: Changepoint Prior Scale: .01 Seasonality Prior Scale: 15 Seasonality Mode: Additive RMSE: 2227.741

Now lets plug in those parameters, and graph the results

m <-  prophet(yearly.seasonality = TRUE,
            weekly.seasonality = TRUE,
            daily.seasonality = FALSE,
            seasonality.mode = 'additive',
            seasonality.prior.scale = 15,
            changepoint.prior.scale = 0.01)
m <-  fit.prophet(m, training)

future = make_future_dataframe(m,
                               periods = nrow(test))

#Change date type of future
future$ds <- ymd(future$ds)

#Remove NA Values from future
future <- future %>% drop_na

#Forecasting
forecast <-  predict(m, future)

#visualization
plot(m, forecast)

Now we will pull the accuracy of this plot so we can see all measurements

predictions = tail(forecast$yhat, nrow(test))
accuracy(predictions, test$y)
##                ME     RMSE      MAE       MPE    MAPE
## Test set 88.35236 962.5934 563.7344 -3824.005 4287.66

Now we want to view the forecast we created

prophet <- as.data.frame(predictions)
View (prophet)