Time series forecasting scopes the process of making predictions about the future based on past data.
To perform time-series forecasting, there are pre-requisites terms that our data shall follow :
A time series data can be decomposed into several components :
There are two types of relationships between the 3 components :
\[ Y(t)=Trend +Seasonal +Error \]
\[ Y(t)=Trend *Seasonal *Error \]
In the charts above, we can observe that the magnitude of fluctuations tend to stay roughly the same in additive seasonality. Whereas in multiplicative seasonality, the magnitude of fluctuations increases.
In this project, we are going to try implementing two different methods of time-series forecasting on our data :
Forecasts of STL objects are obtained by applying a non-seasonal forecasting method to the seasonally adjusted data and re-seasonalizing using the last year of the seasonal component.
TBATS model utilizes the combination of fourier terms with exponential smoothing state space models and Box-Cox transformation. The TBATS model allows seasonality to change slowly overtime, which is an advantage compared to harmonic regression which only allows seasonal pattern to repeat without changing.
In this project, we are going to utilize our time-series forecasting abilities to assist the US Government in identifying whether the country’s demand for sugar will be able to be fulfilled by it’s own produce.
First of all, we load all the required packages for our project.
library(forecast)
library(dplyr)
library(lubridate)
library(TSstudio)
library(padr)
library(ggplot2)
library(plotly)
library(tseries)sugar<-read.csv("IPG3113N.csv")rmarkdown::paged_table(sugar)This data is a production index of Non-Durable Goods: Sugar and Confectionery Product in the united states. This data is an index out of 100, with the first month of the year 2012 as a threshold. The data is recorded monthly, which spans from January 1972 until February 2021.
From the previous step, we know that the column name is still in index form. Thus, we can change the column name.
sugar<-sugar %>% rename(Total_produced=IPG3113N) Since we will create a time-series object in the later step, we want to convert the DATE column into a date-time object. Using the ymd function from the lubridate package.
sugar<-sugar %>%arrange(DATE) %>% mutate(DATE=ymd(DATE)) %>% mutate(month=month(DATE,label = T))Now, let’s see if the data has any empty columns
colSums(is.na(sugar))## DATE Total_produced month
## 0 0 0
None, perfect let’s move to the next step.
Using the pad function from the padr package, we can inspect if our data has any missed period of time.
sugar %>% arrange(DATE)%>% pad(interval = "month") %>% is.na() %>% colSums()## DATE Total_produced month
## 0 0 0
Seems like there is none, perfect.
sugar_ts<-ts(data=sugar$Total_produced,start=ymd("1972-01-01"),frequency=12)First of all,let’s plot our data.
a<-ggplot(sugar,aes(x=DATE,y=Total_produced))+geom_line()+theme_classic()+xlab("Time")+ylab("Index")+ggtitle(label="Sugar and Confectionery Product Produce Index")
ggplotly(a)Upon initial inspection, it seems like the highs and lows of the data occur in the month May and November. Let’s color all of the points in the dataset corresponding to the data.
b<-ggplot(sugar,aes(x=DATE,y=Total_produced)) +
geom_point(data = sugar %>% filter(month %in% c("May","Nov")),
mapping = aes(x = DATE, y = Total_produced, color = month)) +
scale_color_manual(values = c("red", "magenta")) +xlab("Time")+ylab("Index")+
geom_line()+theme_classic()+ggtitle(label="Sugar and Confectionery Product Produce Index")
ggplotly(b)Though not all of the seasonality occurs in the month May and November, it seems like our initial inspection is pretty spot on.
Next, let’s decompose the components of our time-series object.
sugar_ts %>% decompose() %>% autoplot() Upon initial inspection, the trend pattern in our data is not quite smooth yet. This may be an indication of a multi seasonality in the data. We can also see that the seasonality in our data is in fact additive, the magnitude of the fluctuations in the seasonal component stays roughly the same.
Now, we asume that our data has a quaterly and a monthly seasonality pattern.
sugar_ts %>%
msts(seasonal.periods = c(12,12*3)) %>%
mstl() %>%
autoplot() We can see that the trend is still not quite as smooth, next we asume that our data has a biannual and a quarterly seasonal pattern.
sugar_ts %>%
msts(seasonal.periods = c(12*6,12*3)) %>%
mstl() %>%
autoplot()Our trend pattern is already quite smooth, now we create a multiple time series object using the msts function which can handle multiple time series data.
sugar_msts <- sugar$Total_produced %>%
msts(seasonal.periods = c(12*6,12*3)) All in all, we can conclude that :
Prior to the modeling process, stationarity in our data is a detrimental thing. So now let’s check using the augmented dicky-fuller test, which holds the following hypotheses:
Setting alpha = 0.05, we obtain :
adf.test(sugar_msts)## Warning in adf.test(sugar_msts): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: sugar_msts
## Dickey-Fuller = -4.022, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
From the test we obtained that p-value<0.05, thus we reject hypothesis null and our data is stationary.
Now, we are going to split the data to a training and testing set, We are going to use the first 39 years of our data as a training set and the rest as a testing set.
df_train <- sugar_msts %>% head(468)
df_test <- sugar_msts%>% tail(122)stlmThe stlm model offers two options, the ets model and arima model. In which, ETS holt-winters and seasonal arima can both be used for models that has a seasonal and trend component.
Since our time series data is already stationary, we set the lambda to NULL which means that no Box-Cox transformation will be done on our data.
# ets holt-winters
sugar_ets <- stlm(df_train,
method = "ets",
lambda = NULL) #no box-cox transformation will be performed
# seasonal ARIMA
sugar_arima <- stlm(df_train,
method = "arima",
lambda =NULL) #no box-cox transformation will be performedtbatsNow, we build the model using the tbats model.
tbats_mod <- df_train %>%
tbats(use.box.cox = FALSE, # No Box Cox transformation
use.trend = TRUE, # Include trend component
use.damped.trend = TRUE) # Include a damping parameter in the trend or notpred_ets<-forecast(sugar_ets,122)
pred_arima<-forecast(sugar_arima,122)
pred_tbats<-forecast(tbats_mod,h=122) After forecasting all of our data using three different models, we are going to do some visualization to get a visual understanding on how our model performed.
stlm ETS Holt-Winterstest_forecast(actual = sugar_msts,
forecast.obj =pred_ets ,
train = df_train,
test = df_test)stlm SARIMAtest_forecast(actual = sugar_msts,
forecast.obj =pred_arima ,
train = df_train,
test = df_test,)tbats modeltest_forecast(actual = sugar_msts,
forecast.obj =pred_tbats,
train = df_train,
test = df_test)Now, we evaluate and compare the model’s performance using the error metrics.
metrics<-rbind(accuracy(pred_ets$mean, df_test),accuracy(pred_arima$mean, df_test),accuracy(pred_tbats$mean,df_test))
rownames(metrics)<-c("ETS Holt-Winters","Seasonal ARIMA","TBATS Model")
rmarkdown:: paged_table(data.frame(metrics))Using RMSE as a comparing variable, we can see that the tbats model has performed the best out of all of our models. It has the lowest RMSE of 12.54675. The tbats also has the lowest error metrics in all areas compared to the other models. Thus, we can conclude that by performance, the tbats model is the best model and therefore it is our final model.
In a time-series, there are two assumptions :
Setting the alpha to 0.05, we will test if these two assumptions are fulfilled in our final model, which is the tbats model.
The shapiro-wilk test that tests the normality of the residuals hold the hypotheses:
shapiro.test(residuals(tbats_mod))##
## Shapiro-Wilk normality test
##
## data: residuals(tbats_mod)
## W = 0.99261, p-value = 0.02068
We obtain a p-value of 0.02068, which is less than 0.05. This means that the null hypothesis is rejected thus our model’s residuals is not normally distributed.
The ljung-box test that tests the autocorrelation of residuals hold the hypotheses:
Box.test(residuals(tbats_mod), type = "Ljung-Box")##
## Box-Ljung test
##
## data: residuals(tbats_mod)
## X-squared = 0.012747, df = 1, p-value = 0.9101
We obtain a p-value of 0.9101, which greater less than 0.05. This means that the null hypothesis is not rejected and thus our model’s residuals has no autocorrelation.
All in all, our best performing model is the tbats model. However the assumption of normally distributed residuals in tbats model is not yet fulfilled.
Other methods such as neural network for time-series forecasting and bagging methods might be able to fulfill this assumption thus providing a more robust prediction which can provide a better insight to the US Government.
The knowledge of future production quantities will be useful for the government in allocating the resources from imported produce and local produce in which the fulfillment of the country’s need will be assured.
De Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex seasonal patterns using exponential smoothing. J American Statistical Association, 106(496), 15131527. Retrieved from https://robjhyndman.com/publications/complex-seasonality/
Federal Reserve Bank of St.Louis.(2021). Industrial Production: Manufacturing: Non-Durable Goods: Sugar and Confectionery Product (NAICS = 3113). Retrieved from https://fred.stlouisfed.org/series/IPG3113N