This work project consists of analyze a temporal serie that represents the Gross Domestic Product (GDP) of the United States from January 1947 to April 1990. It is represented in quartiles.
First, there is an analysis of the data to identify trends, seasonality, or cycles for the GDP. Then, the analysis is focused on linear regressions to predict trends and seasonality for explaining the behavior of the GDP. Finally, it is compared the GDP whit the number of exportations in United States to know if there is any relationship between both.
Our temporal serie represents the GDP of the United States from January 1947 to April 1990. The observations are made every day. Noticed that GDP is measured Billions of Dollars.
For analyzing the graph, we need to be focused in three characteristics of temporal series: TREND, SEASONALITY AND CYCLING.
-TREND means the general direction of the temporal serie along the time. In that case, the temporal serie tends to increase in a regular way over all the period of time.
-SEASONALITY means cyclic fluctuations that are verified in a temporal serie in a regular way over a year.
-BUSINESS CYCLES are the economic fluctuations in a long period, caused by factors like the phases of economic cycle, politics,…
With the graph it can be seen that there’s a positive trend component. For example, from 1970 to 1980, the GDP has increase in near to 2 millions of billions of dollars. There’s no clear seasonality of the temporal serie during a year. Also there are no shock, anomaly values or missing data.
To see clear the components, we can use another graphical methods like ACF plot and seasonal plot. ACF measures the correlation between the data of a temporal serie and the previous observations in a time interval. ACF’s plot provides information about the autocorrelation structure over a time serie.
On the one hand, we can notice that there’s a slow decrease in the ACF as the lags increase. It is because of the trend. Then, we can confirm that the temporal serie has a trend component. On the other hand, the ACF don’t have a wave motion. Then, we can know that there’s no seasonality component during the temporal serie. For confirming this, we can also build Seasonal plot and Subseries plot.
The aim of this type of graph is to represent the seasonality component of the temporal serie. In that case, each curve indicates the GDP in the UUSS during one year.It can be seen that there’s no a clear pattern between quartiles in each year.
In the subseries plot, the blue lines represent the mean of the GDP of all years for each quartile. The average in the second quartile is a little greater than in the other ones, but there’s not a great difference to confirm that there’s seasonality. Also the graph can be used in order to know if there’s a trend component. It is seen that in the 4 quartiles there’s a increasing trend.
To sum up, the analyzed temporal series has a positive trend component, but doesn’t have a seasonality component.
One of the main objectives of the analysis of a temporal serie is obtaining a forecast model of the future, in the best accurate way possible. There are 4 models of forecast used:
Now, we are going to evaluate what method is more adapted to our data.
For evaluate fit’s goodness it is necessary to compute the residuals’ structure. The residuals are the differences between the observed value and the estimated value through the method. If the method is the accurate one, their residuals are uncorrelated with each other, with mean 0 and with constant variability. Residuals whit these characteristics are named WHITE-NOISE. If there is correlation between residuals , it indicates the presence of information not captured by the method, while an average other than 0 results in biased predictions.
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 247.67, df = 8, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 8
##
## Ljung-Box test
##
## data: Residuals from Random walk with drift
## Q* = 38.923, df = 8, p-value = 5.079e-06
##
## Model df: 0. Total lags used: 8
Looking at the ACF’s graph of both methods, we can affirm that no one of these methods produce white-noise. It is clear because some ACF lags exceed the boundary (which are the blue lines). It is an excessive autocorrelation. Ljung-Box’s test also can examine the autocorrelation in the residuals. As the p-value of the test for both methods is to low, we can confirm that between the residuals of both methods there is a strong dependence.
Fit’s goodness is not necessarily related with an accurate capacity for forecasting of the method. In this case, residuals are not reliable indicators of how large the true forecast errors may be. Predictive accuracy can only be determined by considering the performance of a model on new data that were not used during model fitting. Typically, data are divided into two parts:
In this graph there is no clear which method is the more accurate one. This evaluation of the methods can be done based in two index:
being T the length of the TRAINING SET, h the length of the TEST SET, and ‘e’ the error of the forecast:
\(e_{T + h} = y_{T+h} - \tilde{y}_{T + h}\)
| RMSE | MAE | |
|---|---|---|
| Seasonal naive | 160698.50 | 154448.5 |
| Drift | 81360.89 | 67610.0 |
Both RMSE and MAE indicates that the most suitable model for forecast is the Drift method. It makes sense because in our temporal serie, the most influence component is the trend.
Functional decomposition is that one which individuates the serie’s components and analyze them separately. It helps to analyze in a better way patterns that aren’t so evident. It helps to do a more accurate forecast.
Each graph indicates the weight of the component through the temporal serie. If the graph is similar to the data’s plot, this component is more important. In our case, the trend component has an important weight in the temporal serie. However, seasonal and residuals components haven’t an important weight on it.
For doing a regression model, trend and seasonality can be used as predictors:
\(y = \beta_0 + \beta_1t + \beta_2d_{2,t} + \beta_3d_{3,t} + \beta_4d_{4,t} + \epsilon_t\)
being \(t\) the trend; \(d_{2,t}\) the variable that assumes value 1 when the data came from the second trimester; \(d_{3,t}\) , that assumes value 1 when the data came from the third trimester; and \(d_{4,t}\), that assumes value 1 when the data came from the forth trimester.
This model captures the seasonal component well, as indicated by the plot of residuals against time, and the ACF plot that does not spike every 4 lags. While it does not capture the trend well, the latter was modeled as linear, while the series exhibits trend changes. In fact, in the residuals plot, there is a structure and also the autocorrelation for lags 1 and 2 are excessive. The Breush and Pagan test also indicates the presence of autocorrelation in the residuals, proposing a p-value of less than 0.05
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals from Linear regression model
## LM test = 83.258, df = 8, p-value = 1.077e-14
Fit’s goodness can be considered poor, because the p-value is too low and it indicates that there is significant correlation remaining the residuals. Then, we need to observe the model’s performance by comparing it with the Drift method
Now we are going to compute the errors of this model
| RMSE | MAE | |
|---|---|---|
| MODEL: TREND AND SEASONALITY | 163670.21 | 137107.8 |
| DRIFT MODEL | 81360.89 | 67610.0 |
As it can be seen, the errors for the drift model are lower that those for the model that uses the trend and the seasonality as estimators.
A non-linear trend can be model using piece wise functions. This type of functions change its slope at nodes. The number of nodes and their positions are computed with the trend component of the serie.
Nodes must be located in the points that there’s a change in the direction. The most significant ones are in 1949 and 1961.
As it is shown in the graph, this model is adapted so good to the data. However, it can be proved by the residuals
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 71.083, df = 10, p-value = 2.738e-11
Breush-Godfrey test provides a low p-value, so we can reject the hypotesis of absence of autocorrelation in the residuals.
There’s another regression model named cubic spline. We can compute it and evaluate if it fits with the data
##
## Breusch-Godfrey test for serial correlation of order up to 12
##
## data: Residuals from Linear regression model
## LM test = 66.629, df = 12, p-value = 1.361e-09
Fit’s goodness for the spline cubic method is better than that for piecewise method.However, it is needed to evaluate the accuracy of the forecasts for each case.
In the graphs it is not clear what model is the more accurate one. To know it in a better way, it is better to compute the error index:
| RMSE | MAE | |
|---|---|---|
| Piecewise | 168810.2 | 134283.4 |
| Cubic Spline | 268691.4 | 234483.0 |
The error confirm that the piecewise model is better than the cubic spline method, and it is more accurate also than the Drift model.
Up to this moment, we have used as regressors components of the temporal serie. However, it can be possible to use other temporal series and model the relation between the variables for estimate a model regression.
In this case, the independent variable used are the exportations of the UUSS in the same period.
There’s a positive trend in the graph. For study the seasonality, in this case i’m going to use a polar season plot.
As it can be seen, there’s no seasonality component in the temporal serie, same as in the GDP temporal serie.
###SPURIOUS REGRESSION’S MODEL For non-stationary temporal series, it is possible to produce an spurious regression To avoid this situation, it is good to make the series stationary.
Above there are the temporal series without the trend and the seasonality component. Below there are the values of the correlation of both temporal series. As they assume positive values, so greater GDP in the UUSS, greater exportations.It makes sense beacuse the exportations contribute to the GDP of a country.
| GDP | EXP | |
|---|---|---|
| GDP | 1.0000000 | 0.2288616 |
| EXP | 0.2288616 | 1.0000000 |
Regression’s model is estimate in a training test, using as unique estimator the exportations in the UUSS. A increase of 1 billion of dolars of the exportations will make an increase of arround 0.22 billions the GDP in the UUSS.
##
## Call:
## tslm(formula = gdp_st_tr ~ exp_st_tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47622 -10977 616 11605 41137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -466.9764 1836.4234 -0.254 0.800
## exp_st_tr 0.3941 0.5946 0.663 0.509
##
## Residual standard error: 17030 on 84 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.005203, Adjusted R-squared: -0.00664
## F-statistic: 0.4393 on 1 and 84 DF, p-value: 0.5093
As it is seen above, the adjusted R-squared is poor.
Now we are going to analyze the residauls:
##
## Breusch-Godfrey test for serial correlation of order up to 8
##
## data: Residuals from Linear regression model
## LM test = 23.078, df = 8, p-value = 0.003266
It is noticed that there is not autocorrelation (p-value of Breush-Godfrey test is not too low). However, at this point the residuals do not fit with a normal distribution.
Once it is evaluated the fit’s goodness it is possible to use the model for forecasting. Forecasts are made from the stationary serie, but our aim is to do them from the original one. So, we need to take into account trend and seasonality.
Here there are the forecasts with respect to the test set. For computing the trend component, it is needed to compute the moving averages, and at the beginning and end of the series there are not enough observations. In this case, moving averages of moving averages (2x4 - MA) were calculated to obtain a centered one.
| RMSE | MAE | |
|---|---|---|
| new_mod | 30854.17 | 23456.54 |
| Piecewise | 168810.16 | 134283.36 |
| Cubic Spline | 268691.39 | 234482.96 |
It can be concluded that the new model is the more accurate one.
There are another methods that differ from the classical decomposition that can resolve some limits of the classical one.
####X11 DECOMPOSITION
This type of decomposition is used only with monthly or quarterly data, and separates the serie in additive components.
X11 DECOMPOSITION is more accurate for decomposing the trend (the residual component is less heavy)
It is used in temporal series divided by months or trimesters. It separates the serie in multiplicative components.
X13 DECOMPOSE and MULTIPLICATIVE DECOMPOSE separate the seasonality component in the same way, but the trend component in a different way.