1 Introduction

Time series analysis deals with time series based data to extract patterns for predictions and other characteristics of the data. It uses a model for forecasting future values in a small time frame based on previous observations. It is widely used for non-stationary data, such as economic data, weather data, stock prices, and retail sales forecasting.

We will use Retail dataset of a global superstore which has the data span of 4 years.

Load the Library

library(tidyverse)
library(lubridate) # date manipulation
library(padr) # TS padding
library(zoo) # for missing value TS
library(fpp) # TS dataset
library(TSstudio) # TS interactive viz

library(forecast) # forecasting algorithm
library(tseries) # adf.test
theme_set(theme_minimal())

2 Inspecting The Data

2.1 Load The Data

superstore <- readxl::read_excel("Sample - Superstore.xls")
head(superstore, 5)

We will see a glimpse of the sales data of the overall category in the superstore dataset.

sales_year <-
  superstore %>% 
  select(Order_Date = `Order Date`, 
         Category, Sales) %>% 
  mutate(Order_Date = as.Date(Order_Date),
         Year = year(Order_Date)) %>% 
  group_by(Category, Year) %>% 
  summarise(Sales = sum(Sales)) 
ggplot(sales_year, 
       aes(x=Year, y = Sales)) + 
  geom_point(aes(color = Category), size =3.5) + 
  geom_line(aes(color = Category)) +
  scale_color_manual(values = c("#1d91c0", "#7fcdbb","#253494"))  +
  labs(subtitle = "Superstore Sales Category Over The Years",
     x = "", y = "Total Sales") +
  theme(legend.position = "bottom",
        legend.title = element_blank())

2.2 Select The Furniture Data

There are several product categories in the Superstore sales data, we will start from time series analysis and forecasting for the Furniture category. The time series model that we are doing today specifically applies to uni-variate data, therefore, we only choose one category of products.

furniture <- 
superstore %>% 
  filter(Category == "Furniture")

2.3 Evaluate The Time Series Data Characteristic

Inspecting the range of the Order Date of Furniture:

range(furniture$`Order Date`)
## [1] "2014-01-06 UTC" "2017-12-30 UTC"

So we indeed have almost 4-year of Furniture sales data.

Now, selecting only the crucials columns for the Time Series.

furni_clean <- 
furniture %>% 
  select(order_date = `Order Date`, 
         sales = Sales) %>% 
  mutate(order_date = ymd(order_date))

Checking NA

colSums(is.na(furni_clean))
## order_date      sales 
##          0          0

One of the requirements for Time Series analysis, is that the data needs to be in order by date, therefore we will arrange the Furniture sales data by date for later. But first, lets inspect our data:

furni_clean %>% 
  arrange(order_date) %>% 
  head(10)

By the first 10 rows of the dataset, we can see that there are several days that were gaped, for example, the data start with 2014-01-06, 2014-01-07, then jump to 2014-01-10. On top of that, we can also see there are several dates that have multiple observations, e.g. in 2014-01-13 date.

This occurrence can be tricky to work with, therefore, we will use the average daily sales value for the particular months instead, and we are using the start of each month as the time stamp. Or in other words, we are shaping the data into monthly interval by taking the average amount of the Furniture Sales.

Arranging the date and handling multiple observations.

furniture_clean <- 
furni_clean %>% 
  # arranging the data by date
  arrange(order_date) %>% 
  
  # adding up all the multiple observations
  group_by(order_date) %>% 
  summarise(sum(sales)) %>% 
  select(order_date = order_date,
         sales = `sum(sales)`)
head(furniture_clean,20)

Summarizing the average daily sales for that month, then making the time stamp with the start date of each month.

furni_clean <- 
furniture_clean %>%  
  # making the monthly time stamp
  mutate(month_year = format(as.Date(furniture_clean$order_date), "%Y-%m"),
         month_year = as.Date(paste(month_year, "-01", sep = ""))) %>% 
  # arrange(order_date) %>% 
  
  # enumerating the avearge daily sales per month
  group_by(month_year) %>% 
  summarise(sales = mean(sales))

head(furni_clean,12)

2.4 Visualizing Furniture Sales Time Series Data

ggplot(data = furni_clean, aes(x = month_year, y = sales)) +
  geom_point() +
  geom_line()

We can that there is a repeted pattern when we plot the data which is the time series has seasonality pattern,every 12 months (annual pattern). Sales usually are low at the beginning of the year, especially every February, while at the end of the year, the sales tends to be high.

3 Time Series Object

3.1 Making Object Time Series

furniture_ts <- ts(data = furni_clean$sales,
   frequency = 12, #montly data, yearly pattern
   start = c(2014,1))

3.2 Decomposition

furniture_dc <- 
furniture_ts %>% 
  decompose() 

furniture_dc %>% 
  autoplot()

The result of the decomposition above validate that the frequency we set in the time series object is correct. Morevover, the plot shows that the sales of furniture is fluctiative, along with its obvious seasonality.

3.3 Seasonality Analysis

Seasonality analysis helps us find out at which times the data value is high/low in the seasonal period that we observe. We can use the $seasonal information to create a visualization of the seasonal bar plot.

furni_clean %>% 
  mutate(month = month(month_year, label = T),
         seasonal = furniture_dc$seasonal) %>% 
  
  # making the visualization
  ggplot(aes(x = month, y = seasonal)) + 
  geom_col()

4 Forecasting with Seasonal ARIMA (SARIMA)

4.1 Cross Validation

# the last 12 months for data test
furni_test <- tail(furniture_ts, 1*12)

# the rest of the data for data train
furni_train <- head(furniture_ts, -1*12)

4.2 ADF test

In ARIMA Models, the data we use needs to be stationer. So, right now we will use ADF test to see if our data is already stationer or not. The ADF test belongs to a category of tests called ‘Unit Root Test’, which is the proper method for testing the stationarity of a time series. Unit root is a characteristic of a time series that makes it non-stationary.

adf.test(furni_train)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  furni_train
## Dickey-Fuller = -2.5061, Lag order = 3, p-value = 0.3762
## alternative hypothesis: stationary

We see that the ACF of the p-value > alpha (0.05), that means the data is not stationer.

The data are clearly non-stationary, with some seasonality, so we will first take a seasonal difference.

furni_train %>% 
  diff(lag=12) %>% 
  ggtsdisplay()

These also appear to be non-stationary, so we take an additional first difference.

furni_train %>% 
  diff(lag=12) %>% 
  diff()  %>% 
  ggtsdisplay()

Our aim now is to find an appropriate ARIMA model based on the ACF and PACF shown above.

4.3 Seasonal ARIMA Models

Lets take a closer look at the ACF and PACF.

The significant spike at lag 1 in the ACF suggests a non-seasonal MA(1) component, and a seasonal MA(0) component. By analogous logic applied to the PACF, suggested a non-seasonal AR(1) component, and a seasonal AR(0) component. Consequently, we can begin with an ARIMA(1,1,1)(0,1,0).

furni_sarima <- Arima(furni_train, 
                       order = c(1,1,1), #p,d,q
                       seasonal = c(0,1,0)) #P,D,Q
summary(furni_sarima)
## Series: furni_train 
## ARIMA(1,1,1)(0,1,0)[12] 
## 
## Coefficients:
##          ar1      ma1
##       0.2106  -1.0000
## s.e.  0.2355   0.2499
## 
## sigma^2 estimated as 74861:  log likelihood=-162.04
## AIC=330.09   AICc=331.35   BIC=333.5
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -47.38189 208.9705 130.6237 -9.751958 18.06846 0.6044197
##                     ACF1
## Training set -0.04944231

We see that ARIMA(1,1,1)(0,1,0)[12] errors measure MAPE is 17.72963.

No we can try to compore this manual SARIMA model to the automated one/ Auto Sarima .

Auto Sarima

furni_auto <- auto.arima(y = furni_train, seasonal = T)
summary(furni_auto)
## Series: furni_train 
## ARIMA(0,0,0)(1,1,0)[12] 
## 
## Coefficients:
##          sar1
##       -0.7524
## s.e.   0.1254
## 
## sigma^2 estimated as 30799:  log likelihood=-162.58
## AIC=329.15   AICc=329.72   BIC=331.51
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE    MAPE      MASE        ACF1
## Training set 37.26026 140.2746 99.44913 3.059762 13.0486 0.4601694 0.007705514

Auto SARIMA produced the ARIMA(0,0,0)(1,1,0)[12] model with MAPE measure of 13.2006.

library(gridExtra)

a <- furni_train %>% 
  autoplot(series = "Actual") +
  autolayer(furni_auto$fitted, series = "ARIMA(0,0,0)(1,1,0)[12]") 

b <- furni_train %>% 
  autoplot(series = "Actual") +
  autolayer(furni_sarima$fitted, series = "ARIMA(1,1,1)(0,1,0)[12]")

grid.arrange(b,a)

The graph above shows that we can try a better model fitting with other AR() MA() terms and see which one give a smaller AICc value.

AICc Values for Sesonal ARIMA Models Comparison
SARIMA.Models AICc
ARIMA(0,0,0)(1,1,0)[12] 328.88
ARIMA(1,1,1)(0,1,0)[12] 331.1
ARIMA(1,1,1)(0,1,1)[12] 326.3
ARIMA(1,1,1)(1,1,0)[12] 324.13
ARIMA(1,1,1)(1,1,1)[12] 327.43

Based on the explorations, ARIMA(1,1,1)(1,1,0)[12] give a smallest AICc value. So, we will evaluate Residuals from the fitted ARIMA(1,1,1)(1,1,0)[12] model for the Furniture Sales data.

furni_sarima2 <- Arima(furni_train, 
                       order = c(1,1,1), #p,d,q
                       seasonal = c(1,1,0)) #P,D,Q

checkresiduals(furni_sarima2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,0)[12]
## Q* = 5.4221, df = 4, p-value = 0.2467
## 
## Model df: 3.   Total lags used: 7

Though it is not perfect, however, our model diagnostic suggest that the model residuals are near normally distributed. Thus, we now have a seasonal ARIMA model that passes the required checks and is ready for forecasting.

4.4 Forecasting

We will forecast the sales of Furniture for the next year (12 months), using our selected model.

furni_sarima2_f <- forecast(furni_sarima2, h = 1*12)
autoplot(furni_sarima2_f, series = "ARIMA", fcol = "blue") +
  autolayer(furniture_ts, series = "Actual", color = "black")

The line pinpoint the obeserved inntial values compared to the rolling forecasts prediction. Overall the result of this model align with the true values quite well, showing the trend to be upwards, and also captured the seasonality.

print(paste("RMSE model Seasonal ARIMA:", round(accuracy(object = furni_sarima2_f, data = furni_test)[2], 2)))
## [1] "RMSE model Seasonal ARIMA: 133.84"

The Root Mean Squared Error (RMSE) of the forecast is 133.84

RMSE tell us that the model was able to forecast the average daily furniture sales in the test set within 133.84 of the real sales. The furniture daily sales range from 400 to over 1200, this is a pretty good model so far.

5 Model Evaluation

5.1 No Auto-Correlation Residual

Box.test(furni_sarima2$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  furni_sarima2$residuals
## X-squared = 0.1323, df = 1, p-value = 0.7161

p-value > alpha (0.05), there is no auto-correlation in the residuals.

5.2 Normality Residuals

hist(furni_sarima2$residuals, breaks = 20)

Based on the histogram above, the distribution of the residuals is still ambiguous whether it is normal or not. To be more sure, we will do the shapiro test.

shapiro.test(furni_sarima2$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  furni_sarima2$residuals
## W = 0.91868, p-value = 0.01152

p-value < alpha (0.05), residuals are not distributed normally.

Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.