Time series analysis deals with time series based data to extract patterns for predictions and other characteristics of the data. It uses a model for forecasting future values in a small time frame based on previous observations. It is widely used for non-stationary data, such as economic data, weather data, stock prices, and retail sales forecasting.
We will use Retail dataset of a global superstore which has the data span of 4 years.
Load the Library
library(tidyverse)
library(lubridate) # date manipulation
library(padr) # TS padding
library(zoo) # for missing value TS
library(fpp) # TS dataset
library(TSstudio) # TS interactive viz
library(forecast) # forecasting algorithm
library(tseries) # adf.test
theme_set(theme_minimal())
superstore <- readxl::read_excel("Sample - Superstore.xls")
head(superstore, 5)
We will see a glimpse of the sales data of the overall category in the superstore dataset.
sales_year <-
superstore %>%
select(Order_Date = `Order Date`,
Category, Sales) %>%
mutate(Order_Date = as.Date(Order_Date),
Year = year(Order_Date)) %>%
group_by(Category, Year) %>%
summarise(Sales = sum(Sales))
ggplot(sales_year,
aes(x=Year, y = Sales)) +
geom_point(aes(color = Category), size =3.5) +
geom_line(aes(color = Category)) +
scale_color_manual(values = c("#1d91c0", "#7fcdbb","#253494")) +
labs(subtitle = "Superstore Sales Category Over The Years",
x = "", y = "Total Sales") +
theme(legend.position = "bottom",
legend.title = element_blank())
There are several product categories in the Superstore sales data, we will start from time series analysis and forecasting for the Furniture category. The time series model that we are doing today specifically applies to uni-variate data, therefore, we only choose one category of products.
furniture <-
superstore %>%
filter(Category == "Furniture")
Inspecting the range of the Order Date of Furniture:
range(furniture$`Order Date`)
## [1] "2014-01-06 UTC" "2017-12-30 UTC"
So we indeed have almost 4-year of Furniture sales data.
Now, selecting only the crucials columns for the Time Series.
furni_clean <-
furniture %>%
select(order_date = `Order Date`,
sales = Sales) %>%
mutate(order_date = ymd(order_date))
Checking NA
colSums(is.na(furni_clean))
## order_date sales
## 0 0
One of the requirements for Time Series analysis, is that the data needs to be in order by date, therefore we will arrange the Furniture sales data by date for later. But first, lets inspect our data:
furni_clean %>%
arrange(order_date) %>%
head(10)
By the first 10 rows of the dataset, we can see that there are several days that were gaped, for example, the data start with 2014-01-06, 2014-01-07, then jump to 2014-01-10. On top of that, we can also see there are several dates that have multiple observations, e.g. in 2014-01-13 date.
This occurrence can be tricky to work with, therefore, we will use the average daily sales value for the particular months instead, and we are using the start of each month as the time stamp. Or in other words, we are shaping the data into monthly interval by taking the average amount of the Furniture Sales.
Arranging the date and handling multiple observations.
furniture_clean <-
furni_clean %>%
# arranging the data by date
arrange(order_date) %>%
# adding up all the multiple observations
group_by(order_date) %>%
summarise(sum(sales)) %>%
select(order_date = order_date,
sales = `sum(sales)`)
head(furniture_clean,20)
Summarizing the average daily sales for that month, then making the time stamp with the start date of each month.
furni_clean <-
furniture_clean %>%
# making the monthly time stamp
mutate(month_year = format(as.Date(furniture_clean$order_date), "%Y-%m"),
month_year = as.Date(paste(month_year, "-01", sep = ""))) %>%
# arrange(order_date) %>%
# enumerating the avearge daily sales per month
group_by(month_year) %>%
summarise(sales = mean(sales))
head(furni_clean,12)
ggplot(data = furni_clean, aes(x = month_year, y = sales)) +
geom_point() +
geom_line()
We can that there is a repeted pattern when we plot the data which is the time series has seasonality pattern,every 12 months (annual pattern). Sales usually are low at the beginning of the year, especially every February, while at the end of the year, the sales tends to be high.
furniture_ts <- ts(data = furni_clean$sales,
frequency = 12, #montly data, yearly pattern
start = c(2014,1))
furniture_dc <-
furniture_ts %>%
decompose()
furniture_dc %>%
autoplot()
The result of the decomposition above validate that the frequency we set in the time series object is correct. Morevover, the plot shows that the sales of furniture is fluctiative, along with its obvious seasonality.
Seasonality analysis helps us find out at which times the data value is high/low in the seasonal period that we observe. We can use the $seasonal information to create a visualization of the seasonal bar plot.
furni_clean %>%
mutate(month = month(month_year, label = T),
seasonal = furniture_dc$seasonal) %>%
# making the visualization
ggplot(aes(x = month, y = seasonal)) +
geom_col()
# the last 12 months for data test
furni_test <- tail(furniture_ts, 1*12)
# the rest of the data for data train
furni_train <- head(furniture_ts, -1*12)
In ARIMA Models, the data we use needs to be stationer. So, right now we will use ADF test to see if our data is already stationer or not. The ADF test belongs to a category of tests called ‘Unit Root Test’, which is the proper method for testing the stationarity of a time series. Unit root is a characteristic of a time series that makes it non-stationary.
adf.test(furni_train)
##
## Augmented Dickey-Fuller Test
##
## data: furni_train
## Dickey-Fuller = -2.5061, Lag order = 3, p-value = 0.3762
## alternative hypothesis: stationary
We see that the ACF of the p-value > alpha (0.05), that means the data is not stationer.
The data are clearly non-stationary, with some seasonality, so we will first take a seasonal difference.
furni_train %>%
diff(lag=12) %>%
ggtsdisplay()
These also appear to be non-stationary, so we take an additional first difference.
furni_train %>%
diff(lag=12) %>%
diff() %>%
ggtsdisplay()
Our aim now is to find an appropriate ARIMA model based on the ACF and PACF shown above.
Lets take a closer look at the ACF and PACF.
The significant spike at lag 1 in the ACF suggests a non-seasonal MA(1) component, and a seasonal MA(0) component. By analogous logic applied to the PACF, suggested a non-seasonal AR(1) component, and a seasonal AR(0) component. Consequently, we can begin with an ARIMA(1,1,1)(0,1,0).
furni_sarima <- Arima(furni_train,
order = c(1,1,1), #p,d,q
seasonal = c(0,1,0)) #P,D,Q
summary(furni_sarima)
## Series: furni_train
## ARIMA(1,1,1)(0,1,0)[12]
##
## Coefficients:
## ar1 ma1
## 0.2106 -1.0000
## s.e. 0.2355 0.2499
##
## sigma^2 estimated as 74861: log likelihood=-162.04
## AIC=330.09 AICc=331.35 BIC=333.5
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -47.38189 208.9705 130.6237 -9.751958 18.06846 0.6044197
## ACF1
## Training set -0.04944231
We see that ARIMA(1,1,1)(0,1,0)[12] errors measure MAPE is 17.72963.
No we can try to compore this manual SARIMA model to the automated one/ Auto Sarima .
Auto Sarima
furni_auto <- auto.arima(y = furni_train, seasonal = T)
summary(furni_auto)
## Series: furni_train
## ARIMA(0,0,0)(1,1,0)[12]
##
## Coefficients:
## sar1
## -0.7524
## s.e. 0.1254
##
## sigma^2 estimated as 30799: log likelihood=-162.58
## AIC=329.15 AICc=329.72 BIC=331.51
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 37.26026 140.2746 99.44913 3.059762 13.0486 0.4601694 0.007705514
Auto SARIMA produced the ARIMA(0,0,0)(1,1,0)[12] model with MAPE measure of 13.2006.
library(gridExtra)
a <- furni_train %>%
autoplot(series = "Actual") +
autolayer(furni_auto$fitted, series = "ARIMA(0,0,0)(1,1,0)[12]")
b <- furni_train %>%
autoplot(series = "Actual") +
autolayer(furni_sarima$fitted, series = "ARIMA(1,1,1)(0,1,0)[12]")
grid.arrange(b,a)
The graph above shows that we can try a better model fitting with other AR() MA() terms and see which one give a smaller AICc value.
| AICc Values for Sesonal ARIMA Models Comparison | |
|---|---|
| SARIMA.Models | AICc |
| ARIMA(0,0,0)(1,1,0)[12] | 328.88 |
| ARIMA(1,1,1)(0,1,0)[12] | 331.1 |
| ARIMA(1,1,1)(0,1,1)[12] | 326.3 |
| ARIMA(1,1,1)(1,1,0)[12] | 324.13 |
| ARIMA(1,1,1)(1,1,1)[12] | 327.43 |
Based on the explorations, ARIMA(1,1,1)(1,1,0)[12] give a smallest AICc value. So, we will evaluate Residuals from the fitted ARIMA(1,1,1)(1,1,0)[12] model for the Furniture Sales data.
furni_sarima2 <- Arima(furni_train,
order = c(1,1,1), #p,d,q
seasonal = c(1,1,0)) #P,D,Q
checkresiduals(furni_sarima2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1)(1,1,0)[12]
## Q* = 5.4221, df = 4, p-value = 0.2467
##
## Model df: 3. Total lags used: 7
Though it is not perfect, however, our model diagnostic suggest that the model residuals are near normally distributed. Thus, we now have a seasonal ARIMA model that passes the required checks and is ready for forecasting.
We will forecast the sales of Furniture for the next year (12 months), using our selected model.
furni_sarima2_f <- forecast(furni_sarima2, h = 1*12)
autoplot(furni_sarima2_f, series = "ARIMA", fcol = "blue") +
autolayer(furniture_ts, series = "Actual", color = "black")
The line pinpoint the obeserved inntial values compared to the rolling forecasts prediction. Overall the result of this model align with the true values quite well, showing the trend to be upwards, and also captured the seasonality.
print(paste("RMSE model Seasonal ARIMA:", round(accuracy(object = furni_sarima2_f, data = furni_test)[2], 2)))
## [1] "RMSE model Seasonal ARIMA: 133.84"
The Root Mean Squared Error (RMSE) of the forecast is 133.84
RMSE tell us that the model was able to forecast the average daily furniture sales in the test set within 133.84 of the real sales. The furniture daily sales range from 400 to over 1200, this is a pretty good model so far.
Box.test(furni_sarima2$residuals, type = "Ljung-Box")
##
## Box-Ljung test
##
## data: furni_sarima2$residuals
## X-squared = 0.1323, df = 1, p-value = 0.7161
p-value > alpha (0.05), there is no auto-correlation in the residuals.
hist(furni_sarima2$residuals, breaks = 20)
Based on the histogram above, the distribution of the residuals is still ambiguous whether it is normal or not. To be more sure, we will do the shapiro test.
shapiro.test(furni_sarima2$residuals)
##
## Shapiro-Wilk normality test
##
## data: furni_sarima2$residuals
## W = 0.91868, p-value = 0.01152
p-value < alpha (0.05), residuals are not distributed normally.
Based on the assumption check, there is no autocorrelation on our forecast residuals (p-value > 0.05). Still, our forecast’s residuals are not distributed normally, therefore it’s residuals may not be appeared around its mean as seen in the histogram. But, if we inspect the distribution of residuals through a line plot, it is actually resembles the error plot from our time series object decomposition.