In this project, we will be working with time series data from the Time Series Data Library (FinYang/tsdl), which encompasses a diverse collection of 648 time series across various domains. Specifically, we will focus on the Microeconomic data subset, which includes 36 time series records.
The dataset spans different frequencies—ranging from very high (0.1) to annual (365)—and covers a variety of subjects relevant to time series analysis. For instance, the Microeconomic data features:
27 series with a frequency of 1 (monthly), 1 series with a frequency of 4 (quarterly), 7 series with a frequency of 12 (yearly), 1 series with a frequency of 365 (daily). This rich dataset will be used to develop and assess forecasting models tailored to microeconomic variables, providing insights and predictions relevant for economic analysis and decision-making.
We will explore how to develop a forecasting model for profit using Time Series Analysis, employing four distinct models: ARIMA (AutoRegressive Integrated Moving Average), Holt’s Winter (Exponential Smoothing with Trend and Seasonality), STLF (Seasonal and Trend decomposition using Loess), and TBATS (Trigonometric, Box-Cox, ARMA, Trend, and Seasonal components). By applying these models, we aim to identify the most effective method for predicting future profits based on historical data. This analysis will help us evaluate each model’s performance and accuracy, providing valuable insights for strategic financial planning and decision-making.
library(tidyverse)
library(lubridate)
library(forecast)
library(TTR)
library(fpp)
library(tseries)
library(TSstudio)
library(padr)
library(recipes)
library(tidyquant)
library(ggplot2)
library(tsdl)
options(scipen = 100, max.print = 1e+06)
tsdl
## Time Series Data Library: 648 time series
##
## Frequency
## Subject 0.1 0.25 1 4 5 6 12 13 52 365 Total
## Agriculture 0 0 37 0 0 0 3 0 0 0 40
## Chemistry 0 0 8 0 0 0 0 0 0 0 8
## Computing 0 0 6 0 0 0 0 0 0 0 6
## Crime 0 0 1 0 0 0 2 1 0 0 4
## Demography 1 0 9 2 0 0 3 0 0 2 17
## Ecology 0 0 23 0 0 0 0 0 0 0 23
## Finance 0 0 23 5 0 0 20 0 2 1 51
## Health 0 0 8 0 0 0 6 0 1 0 15
## Hydrology 0 0 42 0 0 0 78 1 0 6 127
## Industry 0 0 9 0 0 0 2 0 1 0 12
## Labour market 0 0 3 4 0 0 17 0 0 0 24
## Macroeconomic 0 0 18 33 0 0 5 0 0 0 56
## Meteorology 0 0 18 0 0 0 17 0 0 12 47
## Microeconomic 0 0 27 1 0 0 7 0 1 0 36
## Miscellaneous 0 0 4 0 1 1 3 0 1 0 10
## Physics 0 0 12 0 0 0 4 0 0 0 16
## Production 0 0 4 14 0 0 28 1 1 0 48
## Sales 0 0 10 3 0 0 24 0 9 0 46
## Sport 0 1 1 0 0 0 0 0 0 0 2
## Transport and tourism 0 0 1 1 0 0 12 0 0 0 14
## Tree-rings 0 0 34 0 0 0 1 0 0 0 35
## Utilities 0 0 2 1 0 0 8 0 0 0 11
## Total 1 1 300 64 1 1 240 3 16 21 648
tdsl_microeconomic <- subset(tsdl,12,"microeconomic")
tdsl_microeconomic
## Time Series Data Library: 7 Microeconomic time series with frequency 12
##
## Frequency
## Subject 12
## Microeconomic 7
micro_ts <- tdsl_microeconomic[[1]]
str(micro_ts)
## Time-Series [1:132] from 1965 to 1976: 52149 47205 82150 100931 98408 ...
## - attr(*, "source")= chr "Abraham & Ledolter (1983)"
## - attr(*, "description")= chr "Monthly U.S. housing starts (privately owned 1-family) 1965 – 1975"
## - attr(*, "subject")= chr "Microeconomic"
summary(micro_ts)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33363 64239 79805 79406 92813 135167
anyNA(micro_ts)
## [1] FALSE
micro_ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1965 52149 47205 82150 100931 98408 97351 96489 88830 80876 85750
## 1966 46561 50361 83236 94343 84748 79828 69068 69362 59404 53530
## 1967 40157 40274 66592 79839 87341 87594 82344 83712 78194 81704
## 1968 45234 55431 79325 97983 86806 81424 86398 82522 80078 85560
## 1969 51300 47909 71941 84982 91301 82741 73523 69465 71504 68039
## 1970 33363 41367 61879 73835 74848 83007 75461 77291 75961 79393
## 1971 54856 58287 91584 116013 115627 116946 107747 111663 102149 102882
## 1972 76185 76306 111358 119840 135167 131870 119078 131324 120491 116990
## 1973 77105 73560 105136 120453 131643 114822 114746 106806 84504 86004
## 1974 43292 57593 76946 102237 96340 99318 90715 79782 73443 69460
## 1975 39791 39959 62498 77777 92782 90284 92782 90655 84517 93826
## Nov Dec
## 1965 72351 61198
## 1966 50212 37972
## 1967 69088 47026
## 1968 64819 53847
## 1969 55069 42827
## 1970 67443 69041
## 1971 92904 80362
## 1972 97428 73195
## 1973 70488 46767
## 1974 57898 41041
## 1975 71646 55650
To analyze the seasonality of a time series, two primary methods are used: additive and multiplicative. When categorizing a time series object, three key components are essential for forecasting:
Trend (T): Represents the overall movement of the mean across a given time period. Seasonal (S): Reflects the recurring pattern observed within each seasonal cycle. Error (E): Accounts for the variations that cannot be explained by the trend or seasonal components.
micro_ts %>%
autoplot()
Based on the provided data, the pattern appears to be more additive rather than multiplicative. This is evident from the absence of a clear exponential trend, with the data fluctuating around the mean without significant overall increases or decreases, and the relatively constant seasonal amplitude throughout the observed period.
micro_ts %>% decompose() %>%
autoplot()
The data reveals a consistent upward trend from 1966 until approximately 1973, followed by a steady decline until 1976. A distinct seasonal pattern is evident, with peaks at the beginning of each year and troughs in the middle of the year. The residual component exhibits random fluctuations around zero, with occasional significant deviations.
micro_decom <- decompose(x = micro_ts)
micro_decom$trend %>% autoplot()
The time series data displays a clear cyclical pattern with an annual peak and trough. There is a general upward trend from 1966 to 1972, followed by a downward trend until 1976. The data also exhibits some random fluctuations around these trends.
micro_decom$seasonal %>% autoplot()
The provided image clearly shows a pronounced seasonal pattern with a repeating annual cycle. Peaks typically occur at the beginning of the year, followed by a sharp decline to a low point in the middle of the year, after which the data values rise again to reach the next peak at the start of the following year.
test_micro <- tail(micro_ts, 24)
train_micro <- head(micro_ts, -length(test_micro))
Since our time series data contains both seasonal and trend components, the model will be developed using:
model_arima_ts <- stlm(train_micro, s.window = 12, method = "arima")
forecast_arima_ts <- forecast(model_arima_ts, h= 24)
forecast_arima_ts
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1974 45840.55 38569.56 53111.55 34720.525 56960.58
## Feb 1974 47536.99 38419.64 56654.33 33593.215 61480.76
## Mar 1974 76012.01 65363.80 86660.23 59726.972 92297.05
## Apr 1974 90736.77 78751.65 102721.88 72407.109 109066.42
## May 1974 94720.01 81532.84 107907.19 74551.969 114888.06
## Jun 1974 90939.44 76650.98 105227.90 69087.123 112791.76
## Jul 1974 84710.21 69399.47 100020.95 61294.457 108125.97
## Aug 1974 85296.23 69027.32 101565.13 60415.082 110177.37
## Sep 1974 77313.16 60139.46 94486.85 51048.256 103578.06
## Oct 1974 77835.92 59802.77 95869.07 50256.599 105415.24
## Nov 1974 63682.77 44829.31 82536.23 34848.895 92516.65
## Dec 1974 49120.34 29480.80 68759.88 19084.255 79156.42
## Jan 1975 45840.55 25445.21 66235.90 14648.566 77032.54
## Feb 1975 47536.99 26412.86 68661.11 15230.427 79843.55
## Mar 1975 76012.01 54183.43 97840.60 42628.070 109395.95
## Apr 1975 90736.77 68225.75 113247.78 56309.140 125164.39
## May 1975 94720.01 71546.66 117893.37 59279.429 130160.60
## Jun 1975 90939.44 67122.16 114756.72 54514.053 127364.83
## Jul 1975 84710.21 60265.96 109154.46 47325.955 122094.47
## Aug 1975 85296.23 60240.69 110351.76 46977.087 123615.36
## Sep 1975 77313.16 51660.90 102965.42 38081.409 116544.91
## Oct 1975 77835.92 51600.50 104071.34 37712.310 117959.53
## Nov 1975 63682.77 36876.89 90488.66 22686.706 104678.84
## Dec 1975 49120.34 21755.87 76484.81 7269.994 90970.68
accuracy(forecast_arima_ts, test_micro)
## ME RMSE MAE MPE MAPE MASE
## Training set -62.90601 5620.809 4489.849 -0.4268869 6.051957 0.3159900
## Test set 530.96663 7958.745 6936.612 -1.0578176 10.198215 0.4881901
## ACF1 Theil's U
## Training set -0.01649926 NA
## Test set 0.63277508 0.6409478
test_forecast(actual = micro_ts, forecast.obj = forecast_arima_ts, train = train_micro, test = test_micro)
model_hw_ts <- HoltWinters(train_micro, seasonal = "additive")
forecast_hw_ts <- forecast(model_hw_ts, h= 24)
accuracy(forecast_hw_ts,test_micro)
## ME RMSE MAE MPE MAPE MASE
## Training set -224.9028 6929.402 5499.193 -0.3653901 7.61040 0.3870264
## Test set 40414.5789 47748.584 40530.823 56.5673658 56.83588 2.8525088
## ACF1 Theil's U
## Training set 0.02445684 NA
## Test set 0.82813482 3.078167
test_forecast(actual = micro_ts, forecast.obj = forecast_hw_ts, train = train_micro, test = test_micro)
model_stlf_ts <- stlf(train_micro,h = 24, s.window = 12)
forecast_stlf_ts <- forecast(model_stlf_ts, h= 24)
accuracy(forecast_stlf_ts,test_micro)
## ME RMSE MAE MPE MAPE MASE
## Training set -51.34777 5620.791 4499.587 -0.4047588 6.070721 0.3166753
## Test set 511.81873 7957.490 6936.612 -1.0860601 10.202868 0.4881901
## ACF1 Theil's U
## Training set -0.01668245 NA
## Test set 0.63277508 0.641267
test_forecast(actual = micro_ts, forecast.obj = forecast_stlf_ts, train = train_micro, test = test_micro)
model_tbats_ts <- train_micro %>% tbats(use.box.cox = T,
use.trend = T,
use.damped.trend = T)
forecast_tbats_ts <- forecast(model_tbats_ts, h= 24)
accuracy(forecast_tbats_ts,test_micro)
## ME RMSE MAE MPE MAPE MASE
## Training set -286.4424 5535.963 4350.47 -0.7101978 5.882747 0.3061807
## Test set 13991.7508 16863.352 14083.22 17.0288704 17.240156 0.9911595
## ACF1 Theil's U
## Training set 0.06948397 NA
## Test set 0.67523579 0.981795
test_forecast(actual = micro_ts, forecast.obj = forecast_tbats_ts, train = train_micro, test = test_micro)
Box.test(model_arima_ts$residuals, type = "Ljung-Box", lag = 24)
##
## Box-Ljung test
##
## data: model_arima_ts$residuals
## X-squared = 34.294, df = 24, p-value = 0.07958
shapiro.test(x = model_arima_ts$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_arima_ts$residuals
## W = 0.99551, p-value = 0.9805
hist(model_arima_ts$residuals, breaks = 10)
Box.test(model_stlf_ts$residuals, type = "Ljung-Box", lag = 24)
##
## Box-Ljung test
##
## data: model_stlf_ts$residuals
## X-squared = 34.798, df = 24, p-value = 0.07145
shapiro.test(x = model_stlf_ts$residuals)
##
## Shapiro-Wilk normality test
##
## data: model_stlf_ts$residuals
## W = 0.99528, p-value = 0.9747
hist(model_stlf_ts$residuals, breaks = 10)
Based on the above metrics, ARIMA and STLF are the best models for the test set, with very similar performance. Among them, ARIMA is slightly superior due to lower RMSE and Theil’s U values, though the difference is minimal. Holt-Winters performs the worst among the four models, followed by TBATS.