Introduction

Time series forecasting scopes the process of making predictions about the future based on past data.

To perform time-series forecasting, there are pre-requisites terms that our data shall follow :

  1. There shouldn’t be any missed periods
  2. There shouldn’t be any missing values
  3. The data must be historically ordered

A time series data can be decomposed into several components :

  1. Seasonal : The repeating cyclical pattern in the series
  2. Trend : The movement of the mean throughout the series
  3. Error : The random variation in the series, is not captured by the seasonal and trend

There are two types of relationships between the 3 components :

  • Additive :

\[ Y(t)=Trend +Seasonal +Error \]

  • Multiplicative

\[ Y(t)=Trend *Seasonal *Error \]

In the charts above, we can observe that the magnitude of fluctuations tend to stay roughly the same in additive seasonality. Whereas in multiplicative seasonality, the magnitude of fluctuations increases.

In this project, we are going to try implementing two different methods of time-series forecasting on our data :

  • STLM : Seasonal and Trend decomposition using Loess

Forecasts of STL objects are obtained by applying a non-seasonal forecasting method to the seasonally adjusted data and re-seasonalizing using the last year of the seasonal component.

  • TBATS : Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality

TBATS model utilizes the combination of fourier terms with exponential smoothing state space models and Box-Cox transformation. The TBATS model allows seasonality to change slowly overtime, which is an advantage compared to harmonic regression which only allows seasonal pattern to repeat without changing.

Business Case

In this project, we are going to utilize our time-series forecasting abilities to assist the US Government in identifying whether the country’s demand for sugar will be able to be fulfilled by it’s own produce.

Pre-Modeling Requirements

Loading Libraries

First of all, we load all the required packages for our project.

library(forecast)
library(dplyr)
library(lubridate)
library(TSstudio)
library(padr)
library(ggplot2)
library(plotly)
library(tseries)

Loading dataset

sugar<-read.csv("IPG3113N.csv")
rmarkdown::paged_table(sugar)

This data is a production index of Non-Durable Goods: Sugar and Confectionery Product in the united states. This data is an index out of 100, with the first month of the year 2012 as a threshold. The data is recorded monthly, which spans from January 1972 until February 2021.

Data Preprocessing

From the previous step, we know that the column name is still in index form. Thus, we can change the column name.

sugar<-sugar %>% rename(Total_produced=IPG3113N) 

Since we will create a time-series object in the later step, we want to convert the DATE column into a date-time object. Using the ymd function from the lubridate package.

sugar<-sugar %>%arrange(DATE) %>% mutate(DATE=ymd(DATE)) %>% mutate(month=month(DATE,label = T))

Now, let’s see if the data has any empty columns

colSums(is.na(sugar))
##           DATE Total_produced          month 
##              0              0              0

None, perfect let’s move to the next step.

Using the pad function from the padr package, we can inspect if our data has any missed period of time.

sugar %>% arrange(DATE)%>% pad(interval = "month") %>% is.na() %>% colSums()
##           DATE Total_produced          month 
##              0              0              0

Seems like there is none, perfect.

Creating Time-Series Object

sugar_ts<-ts(data=sugar$Total_produced,start=ymd("1972-01-01"),frequency=12)

Exploratory Data Analysis

First of all,let’s plot our data.

a<-ggplot(sugar,aes(x=DATE,y=Total_produced))+geom_line()+theme_classic()+xlab("Time")+ylab("Index")+ggtitle(label="Sugar and Confectionery Product Produce Index")
ggplotly(a)

Upon initial inspection, it seems like the highs and lows of the data occur in the month May and November. Let’s color all of the points in the dataset corresponding to the data.

b<-ggplot(sugar,aes(x=DATE,y=Total_produced)) +
  geom_point(data = sugar %>% filter(month %in% c("May","Nov")),
             mapping = aes(x = DATE, y = Total_produced, color = month)) +
  scale_color_manual(values = c("red", "magenta")) +xlab("Time")+ylab("Index")+
  geom_line()+theme_classic()+ggtitle(label="Sugar and Confectionery Product Produce Index")
ggplotly(b)

Though not all of the seasonality occurs in the month May and November, it seems like our initial inspection is pretty spot on.

Next, let’s decompose the components of our time-series object.

sugar_ts %>% decompose() %>% autoplot()

Upon initial inspection, the trend pattern in our data is not quite smooth yet. This may be an indication of a multi seasonality in the data. We can also see that the seasonality in our data is in fact additive, the magnitude of the fluctuations in the seasonal component stays roughly the same.

Now, we asume that our data has a quaterly and a monthly seasonality pattern.

sugar_ts %>% 
  msts(seasonal.periods = c(12,12*3)) %>% 
  mstl() %>% 
  autoplot()

We can see that the trend is still not quite as smooth, next we asume that our data has a biannual and a quarterly seasonal pattern.

sugar_ts %>% 
  msts(seasonal.periods = c(12*6,12*3)) %>% 
  mstl() %>% 
  autoplot()

Our trend pattern is already quite smooth, now we create a multiple time series object using the msts function which can handle multiple time series data.

sugar_msts <- sugar$Total_produced %>% 
 msts(seasonal.periods = c(12*6,12*3)) 

All in all, we can conclude that :

  • Our data has an additive trend pattern
  • Our data consists of two seasonal pattern : biannual and quarterly

Stationary Checking

Prior to the modeling process, stationarity in our data is a detrimental thing. So now let’s check using the augmented dicky-fuller test, which holds the following hypotheses:

  • H0 : The data has a unit root, therefore not stationary
  • H1 : The data doesn’t have a unit root, therefore stationary

Setting alpha = 0.05, we obtain :

adf.test(sugar_msts)
## Warning in adf.test(sugar_msts): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  sugar_msts
## Dickey-Fuller = -4.022, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary

From the test we obtained that p-value<0.05, thus we reject hypothesis null and our data is stationary.

Train-test Split

Now, we are going to split the data to a training and testing set, We are going to use the first 39 years of our data as a training set and the rest as a testing set.

df_train <- sugar_msts %>% head(468)
df_test <- sugar_msts%>% tail(122)

Modeling

Modeling using stlm

The stlm model offers two options, the ets model and arima model. In which, ETS holt-winters and seasonal arima can both be used for models that has a seasonal and trend component.

Since our time series data is already stationary, we set the lambda to NULL which means that no Box-Cox transformation will be done on our data.

# ets holt-winters
sugar_ets <- stlm(df_train, 
                  method = "ets",
                  lambda = NULL)  #no box-cox transformation will be performed
# seasonal ARIMA
sugar_arima <- stlm(df_train, 
                    method = "arima",
                    lambda =NULL) #no box-cox transformation will be performed

Modeling using tbats

Now, we build the model using the tbats model.

tbats_mod <- df_train %>% 
            tbats(use.box.cox = FALSE, # No Box Cox transformation
                  use.trend = TRUE, # Include trend component
                  use.damped.trend = TRUE) # Include a damping parameter in the trend or not

Forecasting

Forecast

pred_ets<-forecast(sugar_ets,122)
pred_arima<-forecast(sugar_arima,122)
pred_tbats<-forecast(tbats_mod,h=122) 

Forecast Plotting

After forecasting all of our data using three different models, we are going to do some visualization to get a visual understanding on how our model performed.

stlm ETS Holt-Winters

test_forecast(actual = sugar_msts, 
              forecast.obj =pred_ets , 
              train = df_train, 
              test = df_test)

stlm SARIMA

test_forecast(actual = sugar_msts, 
              forecast.obj =pred_arima , 
              train = df_train, 
              test = df_test,)

tbats model

test_forecast(actual = sugar_msts, 
              forecast.obj =pred_tbats, 
              train = df_train, 
              test = df_test)

Evaluation

Now, we evaluate and compare the model’s performance using the error metrics.

metrics<-rbind(accuracy(pred_ets$mean, df_test),accuracy(pred_arima$mean, df_test),accuracy(pred_tbats$mean,df_test))
rownames(metrics)<-c("ETS Holt-Winters","Seasonal ARIMA","TBATS Model")
rmarkdown:: paged_table(data.frame(metrics))

Using RMSE as a comparing variable, we can see that the tbats model has performed the best out of all of our models. It has the lowest RMSE of 12.54675. The tbats also has the lowest error metrics in all areas compared to the other models. Thus, we can conclude that by performance, the tbats model is the best model and therefore it is our final model.

Assumption Checking

In a time-series, there are two assumptions :

  • Normally Distributed Residuals
  • Residuals are not autocorrelated

Setting the alpha to 0.05, we will test if these two assumptions are fulfilled in our final model, which is the tbats model.

Normality

The shapiro-wilk test that tests the normality of the residuals hold the hypotheses:

  • H0: The residuals of the models are normally distributed
  • H1: The residuals of the models are not normally distributed
shapiro.test(residuals(tbats_mod))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(tbats_mod)
## W = 0.99261, p-value = 0.02068

We obtain a p-value of 0.02068, which is less than 0.05. This means that the null hypothesis is rejected thus our model’s residuals is not normally distributed.

Auto-Correlation

The ljung-box test that tests the autocorrelation of residuals hold the hypotheses:

  • H0: error has no-autocorrelation
  • H1: error has autocorrelation
Box.test(residuals(tbats_mod), type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  residuals(tbats_mod)
## X-squared = 0.012747, df = 1, p-value = 0.9101

We obtain a p-value of 0.9101, which greater less than 0.05. This means that the null hypothesis is not rejected and thus our model’s residuals has no autocorrelation.

Conclusion

All in all, our best performing model is the tbats model. However the assumption of normally distributed residuals in tbats model is not yet fulfilled.

Other methods such as neural network for time-series forecasting and bagging methods might be able to fulfill this assumption thus providing a more robust prediction which can provide a better insight to the US Government.

The knowledge of future production quantities will be useful for the government in allocating the resources from imported produce and local produce in which the fulfillment of the country’s need will be assured.

References

De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex seasonal patterns using exponential smoothing. J American Statistical Association, 106(496), 15131527. Retrieved from https://robjhyndman.com/publications/complex-seasonality/

Federal Reserve Bank of St.Louis.(2021). Industrial Production: Manufacturing: Non-Durable Goods: Sugar and Confectionery Product (NAICS = 3113). Retrieved from https://fred.stlouisfed.org/series/IPG3113N