Exercises from Hyndman & Athanosopoulos, Forecasting: Principles and Practice, Chapter 9 ARIMA models. All R code is displayed at the end for clarity.
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.
The autocorrelation plots differ because of 1. sampling variation in how acfs are calculated and 2. sample size differences.
Because white noise is comprised of i.i.d observations their autocorrelations are subject to the law of large numbers. The autocorrelations should not exceed the confidence bands 95% of the time according to Section 2.9 on White Noise.
All 3 time series look like white noise because autocorrelations stay inside the band thresholds of the plots.
Section 2.9 on White Noise states that 95% of the spikes in the ACF should lie within \(\pm 2/ \sqrt{T}\) where \(T\) is the length of the time series. Since we have three time series of lengths \(36, 360, 1000\), the size of the bands, i.e. \(T\), is different.
## [1] "T=36 band upper bound: 0.333"
## [1] "T=360 band upper bound: 0.105"
## [1] "T=1000 band upper bound: 0.063"
As the time series length increases, the confidence bands shrink.
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
The AMZN stock price time series looks like a random walk with a positive trend. We observe a trend reversal in the latter half of 2018 but an overall longer term positive trend is obvious. The ACF plot shows high positive autocorrelation consistent with a random walk with drift.
The existence of a trend in the above plots suggest AMZN stock price needs to be differenced. The existence of increasing variance in the stock suggest using a variance-stabilizing transformation like a log transform.
When we apply these transformations,
\[r_t = log(P_t) - log(P_{t-1})\]
and run the resulting diagnostic plots on \(r_t\) we get much more stationary behavior – almost like White Noise as ACF and PACF plots are almost inside the boundaries. We note AMZN still shows fat tails in the stock price log returns consist with large surprises.
The AMZN acf and pacf charts tell us the time series is not stationary because of their similarity to the acf, pacf of a random walk with drift. We simulate a random walk \(X_t\) with drift and display its plots.
The series is defined by:
\[X_t = t + \epsilon_t\] where \(\epsilon_t\) is an independent standard normal distribution.
g = tsibble( t = 1:500, u = rnorm(500, 0, 1), index = t )
g = g %>% mutate( v = t + u )
g %>% gg_tsdisplay(v, plot_type = "partial") + labs(title="Random Walk w Drift")We observe in the chart above that the ACF slowly trends down and the PACF appears to be insignificant at all lags after 1. Since a random walk with drift is not stationary, we argue AMZN is also not stationary without differencing either.
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
global_economy.aus_accommodation.souvenirs.We observe that Turkey’s nominal GDP displays an exponential growth trend punctuated by business cycle recessions and booms.
We observe that the 1960-1961 change in log GDP is -55%. This means the economy completely collapsed. We regard this event as an outlier and remove it from the time series. We determined from Erik Zurcher’s Turkey: A Modern History, 4th edition (Chapters 13, 14):
For these reasons, I choose to drop 1960 GDP from consideration and begin the series at 1961. Then the first differences are observed starting in 1962.
Next, we have to test for stationarity of the first differenced time series. We run two unit root tests: kpss and adf.
## # A tibble: 1 x 3
## Country kpss_stat kpss_pvalue
## <fct> <dbl> <dbl>
## 1 Turkey 0.100 0.1
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Warning in tseries::adf.test(turkey_gdp2$dlogGDP[2:57]): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: turkey_gdp2$dlogGDP[2:57]
## Dickey-Fuller = -4.5696, Lag order = 3, p-value = 0.01
## alternative hypothesis: stationary
We conclude based on the tests that \(\Delta(log(GDP))\) is likely stationarity as an approximation.
We clearly observe seasonality, trend and non-constant variance in the above graph. The series appear to increase in volatility with time. Therefore, we will use a Box-Cox transformation to stabilize the variance before differencing to remove seasonality and trend. Using the Guerrero method we get a Box-Cox \(\lambda = -0.488\).
## [1] -0.04884781
Taking a further look with a gg_season plot, we see a lag 4 seasonality in the transformed series \(Y_t\) below:
To remove seasonality, we apply \(1-L^4\). To remove the trend, we apply \(1-L\). The resulting series \(X_t\) is defined below in terms of \(Y_t\)
\[X_t = (1-L^4)(1-L)Y_t = ( Y_{t} - Y_{t-4} ) - ( Y_{t-1} - Y_{t-5})\]
Plotting it, we see that \(X_t\) looks stationary. Does it pass the statistical tests?
The KPSS test shows \(X_t\) is stationary:
## # A tibble: 1 x 3
## State kpss_stat kpss_pvalue
## <chr> <dbl> <dbl>
## 1 Tasmania 0.0467 0.1
The ADF test shows \(X_t\) to be stationary with k=6 lags. (By default \(k\) is lower, so we are running a stricter test.)
## Warning in tseries::adf.test(tas_deseasoned$X[!is.na(tas_deseasoned$X)], : p-
## value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: tas_deseasoned$X[!is.na(tas_deseasoned$X)]
## Dickey-Fuller = -4.1507, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
We observe in the raw data plot of souvenirs sales a pattern with:
It appears that the detrended, deseasoned log Sales time series is stationary according to the KPSS test. Specifically, if \(X_t\) is the time series that we test and \(Y_t\) is the Sales:
\[X_t = (1-L)(1-L^{12})Y_t\]
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0381 0.1
We confirm that by running the ADF test with \(k=6\) lags:
##
## Augmented Dickey-Fuller Test
##
## data: souv$dS[!is.na(souv$dS)]
## Dickey-Fuller = -4.7674, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
For your retail data (from Exercise 8 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
Using our own seed, we analyze the Pharmaceutical, cosmetic and toiletry good retail sector for Western Australia below. A few rows of the data are shown below for Western Australia below. The Turnover is measured in $Million AUD.
The line plot below overlays the turnover timeseries with a red trendline constructed using the STL decomposition from the fpp3 package. We observe an upward trend from 1982-2005 followed by an increased trend from 2005-2019.
In my previous assignment, I proposed using the Guerrero method to obtain a Box-Cox \(\lambda = 0.141\).
## [1] 0.1414312
The resulting transformed dataseries \(Y_t\) show below has relatively stable variance.
To remove seasonality, we apply a 12 month lag difference. To remove the trend, we apply a 1 month lag difference. This should give a new series \(X_t\) that looks stationary and satisfies the stationarity tests.
\[X_t = (1-L)(1-L^{12}) Y_t\]
The above series looks like White Noise to the naked eye which is stationary.
Applying the KPSS test, we see the time series \(X_t\) appears stationary.
## # A tibble: 1 x 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 Western Austr… Pharmaceutical, cosmetic and toiletry go… 0.0133 0.1
Applying the ADF test to \(X_t\) we get the same conclusion: \(X_t\) is stationary.
## Warning in tseries::adf.test(!is.na(myseries$X)): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: !is.na(myseries$X)
## Dickey-Fuller = -8.9329, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
By contrast, if we apply the KPSS and ADF tests to the original Turnover time series, they show it is non-stationary.
## # A tibble: 1 x 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 Western Austr… Pharmaceutical, cosmetic and toiletry go… 6.60 0.01
##
## Augmented Dickey-Fuller Test
##
## data: myseries$Turnover
## Dickey-Fuller = -1.388, Lag order = 7, p-value = 0.8364
## alternative hypothesis: stationary
Simulate and plot some data from simple ARIMA models.
and \(\sigma^2=1\). The process starts with \(y_1=0\)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)Produce a time plot for the series. How does the plot change as you change \(\phi_1\)?
Write your own code to generate data from an MA(1) model with \(\theta_1=0.6\) and \(\sigma^2=1\).
Produce a time plot for the series. How does the plot change as you change \(\theta_1\)?
Generate data from an ARMA(1,1) model with \(\phi_1=0.6\) , \(\theta_1=0.6\) and \(\sigma^2=1\).
Generate data from an AR(2) model with \(\phi_1=−0.8\), \(\phi_2=0.3\) and \(\sigma^2=1\). (Note that these parameters will give a non-stationary series.)
Graph the latter two series and compare them.
We run the code as provided to simulate the AR(1) model.
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)The simulated output is stored is a tsibble called sim.
Plotting the series we get:
We experiment with plotting AR(1) time series using four values of \(\phi_1=0.05, 0.6, 0.99, 1.05\).
The same white noise process \(\epsilon_t\) is used for all four plots below.
set.seed(100)
y <- numeric(100)
w <- numeric(100)
u <- numeric(100)
z <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
{
y[i] <- 0.6 * y[i-1] + e[i]
w[i] <- 0.05 * w[i-1] + e[i]
u[i] <- 0.99 * u[i-1] + e[i]
z[i] <- 1.05 * z[i-1] + e[i]
}
sim <- tsibble(idx = seq_len(100), y = y, w = w, u = u, z = z, index = idx)We observe that:
We are asked to generate an MA(1) model with \(\phi_1=0.5\) and \(\sigma^2=1\).
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)Let’s try \(\theta_1 = -1, -\frac{1}{2}, 0.05, \frac{1}{2}, 0.99, 1.3\) for the \(MA(1)\) model:
set.seed(100)
y <- numeric(100)
w <- numeric(100)
u <- numeric(100)
z <- numeric(100)
r <- numeric(100)
s <- numeric(100)
e <- rnorm(100)
thetas = c( -1.0, -0.5, 0.05 , 0.5, 0.99 , 1.3 )
for(i in 2:100)
{
u[i] <- thetas[1] * e[i-1] + e[i]
w[i] <- thetas[2] * e[i-1] + e[i]
y[i] <- thetas[3] * e[i-1] + e[i]
z[i] <- thetas[4] * e[i-1] + e[i]
r[i] <- thetas[5] * e[i-1] + e[i]
s[i] <- thetas[6] * e[i-1] + e[i]
}
sim <- tsibble(idx = seq_len(100), u = u, w = w, y = y, z = z, r = r, s = s, index = idx)## u w y z r s
## 1.618886 1.254229 1.007038 1.009586 1.224021 1.433917
We observe that:
The plot below shows an \(ARMA(1,1)\) model realization with \(\theta_1 = 0.6\) and \(\phi_1 = 0.6\).
\[Y_t = 0.6 \cdot Y_{t-1} + 0.6 \cdot \epsilon_{t-1} + \epsilon_{t}\]
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6 * y[i-1] + 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)ARMA(1,1) realization
| idx | y |
|---|---|
| 1 | 0.000 |
| 2 | -0.170 |
| 3 | -0.102 |
| 4 | 0.778 |
| 5 | 1.116 |
| 6 | 1.058 |
We generate an AR(2) model with non-stationary coefficients.
\[X_{t} = -0.8 \cdot X_{t-1} + 0.3 \cdot X_{t-2} + \epsilon_{t}\]
set.seed(100)
zz <- numeric(100)
ez <- rnorm(100)
for(i in 3:100)
zz[i] <- -0.8 * zz[i-1] + 0.3*zz[i-2] + ez[i]
simz <- tsibble(idx = seq_len(100), zz = zz, index = idx)AR(2) realization
| idx | zz |
|---|---|
| 1 | 0.000 |
| 2 | 0.000 |
| 3 | -0.079 |
| 4 | 0.950 |
| 5 | -0.667 |
| 6 | 1.137 |
The AR(2) model has non-constant variance, no trend (with zero mean). Its growth appears to be exponential because the amplitudes of the oscillations grow very fast.
In contrast, the ARMA(1,1) model has constant variance and appears to have no trend. It exists some trends consists with autocorrelation properties.
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
Before using the ARIMA() function on the raw data series, we do exploratory data analysis by visualizing the air passenger time series in 4 plots labeled P1-P4.
The light color plots P1, P3 use the raw data series: Aus Air Passengers in Millions from 1970-2016. P1 shows raw data overlaid by a red trend line using the STL decomposition method. P3 shows the STL decomposition components. The dark themed plots P2, P4 use the log transformed data series: log(Aus Air Passengers in Millions). Their purpose is similar to P1 and P3 respectively.
Observations:
These observations suggest a good ARIMA model should forecast exponential future growth.
Returning to the task, the ARIMA() function selects an ARIMA(0,2,1) model as its optimal choice below.
fit_default = ausp %>% model(ARIMA(Passengers))
report(fit_default)## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
The KPSS test suggests that the twice-differenced Passengers time series is stationary.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0361 0.1
We plot the diagnostics first.
The innovation residuals pass the Ljung Box test. We conclude they look like white noise.
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(Passengers) 6.70 0.754
Lastly, the 10 year model forecasts are shown below. We observe that the mean growth is an increasing straight line. It appears to simply extrapolate the tangent line defined by the derivative at the end point. Thus, the default ARIMA model produces linear not exponential growth forecasts.
The above “default” fitted ARIMA model ARIMA(0,2,1) can be written in terms of the lag (a.k.a. backshift) operator. Let \(Y_t\) denote the raw passengers time series.
The left hand side shows the degree of differencing: 2. There is no AR(p) component otherwise the AR(p) terms goes on the LHS.
The right hand side shows the order of the MA(q) model: 1. So we get:
\[ (1-B)^2 Y_t = (1 + \theta_1B)\epsilon_t \] This can be rewritten to isolate \(Y_t\) on the LHS:
\[ \begin{eqnarray} (1-2B +B^2)Y_t &=& \epsilon_t + \theta_1 B\epsilon_t \\ Y_t -2Y_{t-1}+Y_{t-2} &=& \epsilon_t + \theta_1 \epsilon_{t-1}\\ Y_t &=& 2Y_{t-1} - Y_{t-2} + \theta_{1}\epsilon_{t-1} + \epsilon_{t} \\ Y_t &=& 2Y_{t-1} - Y_{t-2} - 0.8963 \epsilon_{t-1} + \epsilon_{t} \\ \end{eqnarray} \]
We will fit an ARIMA(0,1,0) model with drift below. We use the log(Passengers) as the input variable. This model is basically equivalent to an exponential model on Passengers estimated at discrete time steps.
fit_log = ausp %>% model(ARIMA(log(Passengers) ~ 1 + pdq(0,1,0) ) )
report(fit_log)## Series: Passengers
## Model: ARIMA(0,1,0) w/ drift
## Transformation: log(Passengers)
##
## Coefficients:
## constant
## 0.0499
## s.e. 0.0104
##
## sigma^2 estimated as 0.005089: log likelihood=56.69
## AIC=-109.38 AICc=-109.1 BIC=-105.72
We see that the transformed series passes the unit root test for stationarity below.
ausp %>% mutate(diff_logpass = difference(log(Passengers) )) %>%
features( diff_logpass, unitroot_kpss )## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0516 0.1
We also see that the residual diagnostics look good.
And the model residuals pass the white noise test.
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(log(Passengers) ~ 1 + pdq(0, 1, 0)) 8.19 0.610
Finally, plotting the forecast of this model produces the following plot:
We draw comparisons with the previous model in a.
Now we consider an ARIMA(2,1,2) model with drift on the log(Passengers).
fit_arima212c = ausp %>% model(ARIMA(log(Passengers) ~ 1 + pdq(2,1,2) ) )
report(fit_arima212c)## Series: Passengers
## Model: ARIMA(2,1,2) w/ drift
## Transformation: log(Passengers)
##
## Coefficients:
## ar1 ar2 ma1 ma2 constant
## 0.8362 -0.2286 -1.0767 0.0767 0.0197
## s.e. 0.8981 0.5910 0.9354 0.9333 0.0006
##
## sigma^2 estimated as 0.004391: log likelihood=60.83
## AIC=-109.67 AICc=-107.52 BIC=-98.7
Since we already know the \(\Delta(log(Passengers))\) is stationary from part c, we just need to verify the residuals look like white noise with the Ljung Box test. They pass that test.
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ARIMA(log(Passengers) ~ 1 + pdq(2, 1, 2)) 6.70 0.754
The visual inspection of diagnostic plots below concurs with the Ljung-Box test result.
Now, we plot the resulting forecasts below.
Let’s compare the model outputs in d. with the models from a. and c.
If we remove the constant drift term and fit the ARIMA(2,1,2) model, we get an error message that the resulting model has a non-stationary AR component. So the resulting model cannot be used for forecasting.
fit_arima212 = ausp %>% model(ARIMA(log(Passengers) ~ 0 + pdq(2,1,2) ) )## Warning: 1 error encountered for ARIMA(log(Passengers) ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
report(fit_arima212)## Series: Passengers
## Model: NULL model
## Transformation: log(Passengers)
## NULL model
We observe that ARIMA(0,2,1) model introduces a quadratic or higher order polynomial trend. This is discouraged by the model authors.
fit_arima021c = ausp %>% model(ARIMA(Passengers ~ 1 + pdq(0, 2, 1) ) )## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
report(fit_arima021c)## Series: Passengers
## Model: ARIMA(0,2,1) w/ poly
##
## Coefficients:
## ma1 constant
## -1.0000 0.0571
## s.e. 0.0585 0.0213
##
## sigma^2 estimated as 3.855: log likelihood=-95.1
## AIC=196.21 AICc=196.79 BIC=201.63
If we look at the 10 year ahead forecast, the mean forecast is an upward trending non-linear pattern.
However, the growth rate is slower than that of our models in parts c and d. While this model produces quadratic growth, models in parts c. and d. produce exponential growth.
For the United States GDP series (from global_economy):
ARIMA();ETS() (with no transformation).A suitable Box-Cox transformation is to take the log of GDP, i.e. use \(\lambda=0\). However, we investigate the use of the guerrero algorithm to find an optimal Box-Cox parameter. It produces \(\lambda=0.28\).
## [1] 0.2819443
Therefore, we will simply use the log transformation in the rest of this exercise.
To fit a suitable ARIMA model, we first perform some EDA, test for stationarity and use a search process to fit an ARIMA(1,1,0) model on log(GDP).
Plots P1, P3, P5 show US GDP in raw form, in log form and annual changes of log. Plots P2, P4, P6 show the equivalent output for the US GDP per capita. GDP per capita is obtained by dividing nominal GDP by Population.
Based on the EDA above, we conclude:
\(\Delta(log(GDP))\) (Plot P5) or \(\Delta(log(GDP per capita))\) (Plot P6) could be stationary series. They show little or no significant trend.
US GDP, US GDP per capita and their logs are both choices. Plots P1-P4 are ruled out.
Next, we will see the Unit Root tests fail for US GDP over the long term, i.e. 1960-2017. This problem is well known in the econometric literature. The failures of the ADF and KPSS tests on the full time series are shown below.
In fact, modern GDP forecasting is done by the Federal Reserve bank of New York and Atlantic using a completely different approach. They use a state space model approach using a dynamic factor model that takes around 30 predictor variables. These are received in real-time as news releases which update the GDP forecast daily. The model expects information to arrive on a sporadic lagging basis. The model builders objective is to produce 1 quarter to 2 year ahead forecasts. In contrast, our dataset and forecast horizon is 10 years or longer.
Links to their model and approach are here: (https://www.newyorkfed.org/research/policy/nowcast/faqs.html)
## # A tibble: 1 x 3
## Country kpss_stat kpss_pvalue
## <fct> <dbl> <dbl>
## 1 United States 0.963 0.01
##
## Augmented Dickey-Fuller Test
##
## data: usgdp$dlogGDP[2:length(usgdp$dlogGDP)]
## Dickey-Fuller = -2.6551, Lag order = 12, p-value = 0.3103
## alternative hypothesis: stationary
A solution is to truncate the data series to observations after 1995. We can see that the KPSS test confirms stationarity of the truncated series. I would add that the ADF test FAILS. So this discrepancy in unit root tests will require deeper consideration out of the scope of this assignment. For now, I model the series as if \(\Delta(log(GDP_t))\) is stationary.
usgdp2 = usgdp %>% filter(Year > 1995)
usgdp2 %>% features( dlogGDP, unitroot_kpss)## # A tibble: 1 x 3
## Country kpss_stat kpss_pvalue
## <fct> <dbl> <dbl>
## 1 United States 0.334 0.1
Now we fit an ARIMA model with a drift and constrain \(d=1\) the order of differencing on the \(X_t = log(GDP)\). We use a search process to allow \(p=0 ... 12\) and \(q=0..5\).
fit_usgdp = usgdp2 %>% model( m1 = ARIMA( log(GDP) ~ 1 + pdq(p = 0:12, d=1, q = 0:5 ) ) )
report(fit_usgdp)## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
## Transformation: log(GDP)
##
## Coefficients:
## ar1 constant
## 0.4751 0.0222
## s.e. 0.1885 0.0035
##
## sigma^2 estimated as 0.0003544: log likelihood=56.02
## AIC=-106.04 AICc=-104.63 BIC=-102.91
ARIMA produces an ARIMA(1,1,0) model on \(X_t\) whose equation is:
\[\begin{eqnarray} (1 - \phi_1 B)(1 - B)X_t&=& c + \epsilon_t\\ (1 - \phi_1 B - B + \phi_1B^2) X_t & = & c + \epsilon_t \\ X_t& = & c + \epsilon_t + (1 + \phi_1)X_{t-1} - \phi_1 X_{t-2}\\ X_t& = & 0.0222 + \epsilon_t + (1 + .4751)X_{t-1} - 0.4751X_{t-2} \\ X_t& = & 0.0222 + 1.4751X_{t-1} - 0.4751 X_{t-2} + \epsilon_{t} \end{eqnarray} \]
To find other models, we look at the ACF and PACF of the series \(Y_t = \Delta X_t = log(GDP_t) - log(GDP_{t-1})\). The series diagnostic plot suggests that acf and pacf has significant values at lags 1 only.
Consequently, we consider 3 candidate models based on ACF and PACF information.
We specify them as ARIMA models on the same variable \(X_t\). We include a drift term because the US economy has a positive growth rate with deviations during economic downturns or recessions.
fit_usgdp = usgdp2 %>% model( m1 = ARIMA( log(GDP) ~ 1 + pdq(p = 0:12, d=1, q = 0:5 ) ) ,
m2 = ARIMA( log(GDP) ~ 1 + pdq(p = 1, d = 1, q = 1 ) ) ,
m3 = ARIMA( log(GDP) ~ 1 + pdq(p = 2, d = 1, q = 2 ) )
)Their parameters are reported below for completeness.
Model m1 has the following parameters (repeating the output from part b.)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
## Transformation: log(GDP)
##
## Coefficients:
## ar1 constant
## 0.4751 0.0222
## s.e. 0.1885 0.0035
##
## sigma^2 estimated as 0.0003544: log likelihood=56.02
## AIC=-106.04 AICc=-104.63 BIC=-102.91
Model m2 has the following parameters. The coefficients values are plausible but the AICc is inferior.
## Series: GDP
## Model: ARIMA(1,1,1) w/ drift
## Transformation: log(GDP)
##
## Coefficients:
## ar1 ma1 constant
## 0.2884 0.2600 0.0300
## s.e. 0.3499 0.3349 0.0044
##
## sigma^2 estimated as 0.0003652: log likelihood=56.29
## AIC=-104.58 AICc=-102.08 BIC=-100.4
Model m3 has the following parameters. The AR and the MA coefficients are unstable because they are opposite in sign and same magnitude. For example, \(\phi_1 = 1.11\) while \(\phi_2 = -0.948\).
## Series: GDP
## Model: ARIMA(2,1,2) w/ drift
## Transformation: log(GDP)
##
## Coefficients:
## ar1 ar2 ma1 ma2 constant
## 1.1101 -0.9487 -0.9414 0.9999 0.0346
## s.e. 0.2065 0.1009 0.3803 0.5288 0.0034
##
## sigma^2 estimated as 0.0003386: log likelihood=56.92
## AIC=-101.85 AICc=-95.85 BIC=-95.58
We choose that model m1 (the ARIMA(1,1,0)) as our best candidate because it:
| Country | .model | sigma2 | AICc |
|---|---|---|---|
| United States | m1 | 0.000354 | -104.63067 |
| United States | m2 | 0.000365 | -102.07837 |
| United States | m3 | 0.000339 | -95.84813 |
We also check their residual diagnostics in the following charts. None of the model residuals look like white noise, but there are only 23 data points. However, the acf graphs don’t significant significant autocorrelations.
We see that all 3 ARIMA models produce similar forecasts but different confidence intervals. All 3 forecasts look reasonable.
In general, model m3 produces the narrowest confidence intervals. Model m2 produced the widest intervals. Model m1, our preferred one, produces intermediate confidence intervals.
The forecasts are around 4.3% annualized growth which is somewhat plausible as the current GDP growth forecast by the New York Fed Nowcast model is 3.8% for Q3 2021. (https://www.newyorkfed.org/research/policy/nowcast.html) However, in the longer term, GDP forecast show be in the range of 1-4 percent with a 2-3 percent midrange.
US GDP Forecast Trillions
| Year | GDP | growth |
|---|---|---|
| 2018 | 20.21 | NA |
| 2019 | 21.08 | 4.31 |
| 2020 | 22.00 | 4.35 |
| 2021 | 22.96 | 4.37 |
| 2022 | 23.97 | 4.38 |
| 2023 | 25.02 | 4.38 |
| 2024 | 26.11 | 4.38 |
| 2025 | 27.26 | 4.38 |
| 2026 | 28.46 | 4.39 |
| 2027 | 29.70 | 4.39 |
## Series: GDP
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9998997
## beta = 0.0001002624
##
## Initial states:
## l[0] b[0]
## 7.545823e+12 527283884848
##
## sigma^2: 3e-04
##
## AIC AICc BIC
## 1225.410 1229.160 1230.865
In the above chart showing 10 year ahead forecasts from the ARIMA(1,1,0) model (called m1) versus the exponential smoothing model ETS(M,A,N), we see that
We do not find the ETS model useful in forecasting GDP.
We summarize all the R code used in this project in this appendix for ease of reading.
knitr::opts_chunk$set(echo = FALSE)
.codeExample{
background-color: lightgray
}
library(fpp3)
library(knitr)
library(tidyverse)
library(kableExtra)
library(cowplot)
library(skimr)
library(ggthemes)
.code-bg {
background-color: lightgray;
}
sprintf("T=36 band upper bound: %4.3f", 2/sqrt(36))
sprintf("T=360 band upper bound: %4.3f", 2/sqrt(360))
sprintf("T=1000 band upper bound: %4.3f", 2/sqrt(1000))
gafa_stock %>% filter( Symbol == "AMZN") %>% gg_tsdisplay(Adj_Close, plot_type="partial") + labs(title="AMZN stock price 2014-2018")
gafa_stock %>% filter( Symbol == "AMZN") %>% mutate( dS = difference(log(Adj_Close))) %>%
gg_tsdisplay(dS, plot_type="partial") + labs(title="AMZN stock price 2014-2018")
g = tsibble( t = 1:500, u = rnorm(500, 0, 1), index = t )
g = g %>% mutate( v = t + u )
g %>% gg_tsdisplay(v, plot_type = "partial") + labs(title="Random Walk w Drift")
turkey_gdp = global_economy %>% filter( str_detect(Country, "Turkey") ) %>% select(Country, Year, GDP, Population)
turkey_gdp %>% autoplot(GDP/1e9) + labs(title="Turkey GDP 1960-2017", y = "Billions USD")
turkey_gdp %>% mutate(dlogGDP = difference(log(GDP) ) ) %>% autoplot(dlogGDP)+ labs(title="Change Log(GDP) Turkey")
turkey_gdp2 = turkey_gdp %>% slice(2:n() )
turkey_gdp2 = turkey_gdp2 %>% mutate( dlogGDP = difference(log(GDP)))
gg_tsdisplay(turkey_gdp2, y=dlogGDP, plot_type = "partial")
turkey_gdp2 %>% features( dlogGDP , unitroot_kpss)
tseries::adf.test(turkey_gdp2$dlogGDP[2:57] )
tas_taking = aus_accommodation %>% filter( State == 'Tasmania' )
tas_taking %>% autoplot(Takings) + labs(title="Tasmania Accomodation Takings")
(lambda <- tas_taking %>% features( Takings, features = guerrero) %>% pull(lambda_guerrero) )
tas_taking = tas_taking %>% mutate( bcTakings = box_cox(Takings, lambda) )
tas_taking %>% autoplot(bcTakings) + labs(title= "Box Cox Transformed with -0.0488")
tas_taking %>% gg_season( bcTakings) + labs(title="Y(t) the Box-Cox Transformed Series Seasonality Plot")
tas_deseasoned = tas_taking %>% mutate( dS = difference(Takings, lag = 4) , X = difference( dS, lag = 1) )
tas_deseasoned %>% autoplot(X) + labs(title = "Deseasoned, DeTrended, Box-Cox Adjusted Takings", x = "Quarter", y = "Transformed(Takings)")
tas_deseasoned %>% features(X, unitroot_kpss)
tseries::adf.test(tas_deseasoned$X[!is.na(tas_deseasoned$X)] , k = 6)
souvenirs %>% autoplot(Sales) + labs(title = "Souvenirs- Monthly Sales")
souvenirs %>% mutate( logSales = log(Sales), dt = difference(logSales), dS = difference(dt, lag = 12) ) -> souv
souv %>% autoplot(logSales) + labs(title="Log Sales")
souv %>% autoplot(dt) + labs(title = "Detrended log Sales")
souv %>% autoplot(dS) + labs(title = "Detrended, Deseasonsed log Sales")
souv %>% features(dS, unitroot_kpss )
tseries::adf.test(souv$dS[!is.na(souv$dS)], k=6)
set.seed(392381)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
dcmp <- myseries %>% model(stl= STL(Turnover))
stl_info = components(dcmp) %>% as_tsibble()
autoplot(stl_info, Turnover) +
labs(title="Retail Turnover of Pharmaceutical, cosmetic, toiletry goods",
subtitle="Western Australia",
y = "AUD$ Millions") + geom_line(aes(y=trend), color="red")
lambda <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
lambda
myseries$BCTurnover = box_cox(myseries$Turnover, lambda )
dcmpBC <- myseries %>% model( stl=STL(BCTurnover))
stl_infoBC = components(dcmpBC) %>% as_tsibble()
autoplot(stl_infoBC, BCTurnover) +
geom_line(aes(y=trend), color="red") +
ggtitle("Box Cox Adjusted - Sales Turnover - Western Australia")
myseries = myseries %>% mutate( dS = difference(BCTurnover, lag = 12 ), X = difference( dS, lag = 1 ) )
myseries %>% autoplot(X) + labs(title="Detrended, deseasonalized, Box-Cox adjusted Turnover data")
myseries %>% features( X, unitroot_kpss)
tseries::adf.test( !is.na(myseries$X))
myseries %>% features( Turnover, unitroot_kpss)
tseries::adf.test( myseries$Turnover )
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim %>% autoplot( y ) + labs(title="Simulated AR(1) model: 100 points from the series")
set.seed(100)
y <- numeric(100)
w <- numeric(100)
u <- numeric(100)
z <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
{
y[i] <- 0.6 * y[i-1] + e[i]
w[i] <- 0.05 * w[i-1] + e[i]
u[i] <- 0.99 * u[i-1] + e[i]
z[i] <- 1.05 * z[i-1] + e[i]
}
sim <- tsibble(idx = seq_len(100), y = y, w = w, u = u, z = z, index = idx)
p1 <- sim %>% autoplot(w) + labs( title = "AR(1) phi = 0.05")
p2 <- sim %>% autoplot(y) + labs( title = "AR(1) phi = 0.6")
p3 <- sim %>% autoplot(u) + labs( title = "AR(1) phi = 0.99")
p4 <- sim %>% autoplot(z) + labs( title = "AR(1) phi = 1.05")
plot_grid(p1, p2, p3, p4, ncol = 2)
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim %>% autoplot(y) + labs(title="MA(1) model with 0.6 coefficient")
set.seed(100)
y <- numeric(100)
w <- numeric(100)
u <- numeric(100)
z <- numeric(100)
r <- numeric(100)
s <- numeric(100)
e <- rnorm(100)
thetas = c( -1.0, -0.5, 0.05 , 0.5, 0.99 , 1.3 )
for(i in 2:100)
{
u[i] <- thetas[1] * e[i-1] + e[i]
w[i] <- thetas[2] * e[i-1] + e[i]
y[i] <- thetas[3] * e[i-1] + e[i]
z[i] <- thetas[4] * e[i-1] + e[i]
r[i] <- thetas[5] * e[i-1] + e[i]
s[i] <- thetas[6] * e[i-1] + e[i]
}
sim <- tsibble(idx = seq_len(100), u = u, w = w, y = y, z = z, r = r, s = s, index = idx)
p1 <- sim %>% autoplot(u) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[1] ) , x="") + ylim(-4, 4)
p2 <- sim %>% autoplot(w) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[2] ) , x="") + ylim(-4, 4)
p3 <- sim %>% autoplot(y) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[3] ) , x="") + ylim(-4,4)
p4 <- sim %>% autoplot(z) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[4] ) , x="") + ylim(-4,4)
p5 <- sim %>% autoplot(r) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[5] ) , x="") + ylim(-4,4)
p6 <- sim %>% autoplot(s) + labs( title = sprintf("MA(1) theta = %0.2f", thetas[6] ) , x="") + ylim(-4,4)
plot_grid(p1, p2, p3, p4, p5, p6, ncol = 3)
sqrt(diag((var(sim[2:7]))))
set.seed(100)
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6 * y[i-1] + 0.6*e[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
head(sim) %>% kable(caption="ARMA(1,1) realization", digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position="left")
set.seed(100)
zz <- numeric(100)
ez <- rnorm(100)
for(i in 3:100)
zz[i] <- -0.8 * zz[i-1] + 0.3*zz[i-2] + ez[i]
simz <- tsibble(idx = seq_len(100), zz = zz, index = idx)
head(simz) %>% kable(caption="AR(2) realization", digits = 3) %>%
kable_styling(bootstrap_options = c("hover", "striped"), position="left")
sim %>% autoplot(y) + labs(title="ARMA(1,1) model")
simz %>% autoplot(zz) + labs(title="AR(2) model non-stationary")
The light color plots P1, P3 use the raw data series: Aus Air Passengers in Millions from 1970-2016.
P1 shows raw data overlaid by a red trend line using the STL decomposition method.
P3 shows the STL decomposition components.
The dark themed plots P2, P4 use the log transformed data series: log(Aus Air Passengers in Millions).
Their purpose is similar to P1 and P3 respectively.
ausp = aus_airpassengers
stl <- ausp %>% model( stl=STL(Passengers))
stl_log <- ausp %>% model( stl=STL(log(Passengers)))
cmp_stl = components(stl)
cmp_stl_log = components(stl_log)
p1 = cmp_stl %>% as_tsibble() %>% autoplot(Passengers) +
geom_line( aes(y = trend), color="red") +
labs(title = "P1 -Aus Air Passengers 1970-2016", y = "Millions Passengers" )
p2 = cmp_stl_log %>% as_tsibble() %>% autoplot(`log(Passengers)`) +
geom_line( aes(y = trend), color="red") +
labs(title = "P2 -Log(Aus Air Passengers) 1970-2016", y = "log(Millions Passengers)" ) +
theme_dark()
p3 = cmp_stl %>% autoplot() + labs(title = "P3 -STL Aus Air Passengers")
p4 = cmp_stl_log %>% autoplot() + labs(title = "P4 -STL Log(Aus Air Passengers)") + theme_dark()
plot_grid(p1, p2, p3, p4, ncol = 2)
fit_default = ausp %>% model(ARIMA(Passengers))
report(fit_default)
ausp %>% mutate(diff_pass = difference(Passengers), diff2 = difference(diff_pass) ) %>%
features( diff2, unitroot_kpss )
fit_default %>% gg_tsresiduals()
augment(fit_default) %>% features(.innov, ljung_box, lag = 10)
fc_default = fit_default %>% forecast(h=10)
fc_default %>% autoplot(ausp)
fit_log = ausp %>% model(ARIMA(log(Passengers) ~ 1 + pdq(0,1,0) ) )
report(fit_log)
ausp %>% mutate(diff_logpass = difference(log(Passengers) )) %>%
features( diff_logpass, unitroot_kpss )
fit_log %>% gg_tsresiduals()
augment(fit_log) %>% features(.innov, ljung_box, lag = 10)
fc_log = fit_log %>% forecast(h=10)
fc_log %>% autoplot(ausp) + labs(title = "10 year forecast of Aus passengers",
subtitle = "ARIMA(0,1,0) on log(Passengers)" ,
y = "log(Millions Passengers)" )
fit_arima212c = ausp %>% model(ARIMA(log(Passengers) ~ 1 + pdq(2,1,2) ) )
report(fit_arima212c)
augment(fit_arima212c) %>% features(.innov, ljung_box, lag = 10)
fit_arima212c %>% gg_tsresiduals()
fc_arima212c = fit_arima212c %>% forecast(h=10)
fc_arima212c %>% autoplot(ausp) + labs(title="ARIMA(2,1,2) w/drift on Log(Passengers)")
compare_fit = ausp %>% model(arima212c = ARIMA(log(Passengers) ~ 1 + pdq(2,1,2) ) ,
log = ARIMA(log(Passengers) ~ 1 + pdq(0,1,0) )
)
compare_fit %>% forecast(h=10) %>% autoplot(ausp %>% filter(Year > 2000 ), level = NULL) +
labs(title="Comparing ARIMA(2,1,2) and ARIMA(0,1,0) forecasts with drift")
fit_arima212 = ausp %>% model(ARIMA(log(Passengers) ~ 0 + pdq(2,1,2) ) )
report(fit_arima212)
fit_arima021c = ausp %>% model(ARIMA(Passengers ~ 1 + pdq(0, 2, 1) ) )
report(fit_arima021c)
fc_arima021c = fit_arima021c %>% forecast(h=10)
fc_arima021c %>% autoplot(ausp)
usgdp = global_economy %>% filter( Country == 'United States') %>%
mutate( GDPc = GDP/Population,
logGDP = log(GDP),
logGDPc = log(GDPc),
dlogGDP = difference(log(GDP)),
dlogGDPc = difference(log(GDPc)))
(lambda <- usgdp %>% features( GDP, features = guerrero) %>% pull(lambda_guerrero) )
bc = usgdp %>% mutate( bcGDP = box_cox(GDP, lambda))
p1 = bc %>% autoplot(bcGDP) + labs(title="Box-Cox 0.28")
p2 = bc %>% autoplot(logGDP) + labs(title = "Log(GDP)")
plot_grid(p1, p2, ncol= 2)
Plots P1, P3, P5 show US GDP in raw form, in log form and annual changes of log.
Plots P2, P4, P6 show the equivalent output for the US GDP per capita.
GDP per capita is obtained by dividing nominal GDP by Population.
p1 <- usgdp %>% autoplot(GDP) + labs(title = "P1 - US GDP 1960-2017")
p2 <- usgdp %>% autoplot(GDPc) + labs(title = "P2 - US GDP per capita 1960-2017")
p3 <- usgdp %>% autoplot(logGDP) + labs(title = "P3 - US log(GDP)")
p4 <- usgdp %>% autoplot(logGDPc) + labs(title = "P4 - US log(GDP per capita)")
p5 <- usgdp %>% autoplot(dlogGDP) + labs(title = "P5 - US Change Log(GDP)")
p6 <- usgdp %>% autoplot(dlogGDPc) + labs(title= "P6 - US Change Log(GDP per capita)")
plot_grid(p1, p2, p3, p4 , p5 , p6, ncol = 2)
usgdp %>% features( dlogGDP, unitroot_kpss )
tseries::adf.test(usgdp$dlogGDP[2:length(usgdp$dlogGDP)], k = 12)
usgdp2 = usgdp %>% filter(Year > 1995)
usgdp2 %>% features( dlogGDP, unitroot_kpss)
fit_usgdp = usgdp2 %>% model( m1 = ARIMA( log(GDP) ~ 1 + pdq(p = 0:12, d=1, q = 0:5 ) ) )
report(fit_usgdp)
fc_usgdp = fit_usgdp %>% forecast(h = 10)
fc_usgdp %>% autoplot(usgdp2)
usgdp2 %>% gg_tsdisplay(dlogGDP, plot_type = "partial")
fit_usgdp = usgdp2 %>% model( m1 = ARIMA( log(GDP) ~ 1 + pdq(p = 0:12, d=1, q = 0:5 ) ) ,
m2 = ARIMA( log(GDP) ~ 1 + pdq(p = 1, d = 1, q = 1 ) ) ,
m3 = ARIMA( log(GDP) ~ 1 + pdq(p = 2, d = 1, q = 2 ) )
)
report(fit_usgdp %>% select(m1) )
report(fit_usgdp %>% select(m2) )
report(fit_usgdp %>% select(m3) )
glance(fit_usgdp) %>% select(Country, .model, sigma2, AICc) %>% kable(digits = 6) %>%
kable_styling(bootstrap_options = c("hover", "striped"))
fit_usgdp %>% select(m1) %>% gg_tsresiduals() + labs(title = "M1 = ARIMA(1,1,0)")
fit_usgdp %>% select(m2) %>% gg_tsresiduals() + labs(title = "M2 = ARIMA(1,1,1)")
fit_usgdp %>% select(m3) %>% gg_tsresiduals() + labs(title = "M3 - ARIMA(2,1,2)")
p1 = fit_usgdp %>% forecast(h = 10) %>% autoplot(usgdp2) + labs(title = "With Confidence Intervals")
p2 = fit_usgdp %>% forecast(h = 10) %>% autoplot(usgdp2, level = NULL) + labs(title = "Mean Forecasts Only")
plot_grid(p1, p2)
fc = fit_usgdp %>% select(m1) %>% forecast(h = 10 )
unpack_hilo( hilo(fc) ) %>% select( Year, .mean) %>% rename( gdp = `.mean`) %>%
mutate( growth = 100 * (gdp/lag(gdp)-1) ) %>% as_tibble() %>% select( Year, gdp, growth) -> aa
aa %>% mutate( GDP = gdp/1e12 ) %>% select( Year, GDP, growth ) %>%
kable(digits = 2, caption = "US GDP Forecast Trillions" ) %>%
kable_styling(bootstrap_options = c("hover", "striped"))
fit_usgdp = usgdp2 %>% model( ets1 = ETS( GDP ) ,
m1 = ARIMA( log(GDP) ~ 1 + pdq(p = 0:12, d=1, q = 0:5 ) ) )
report(fit_usgdp %>% select(ets1))
fit_usgdp %>% forecast(h = 10) %>% autoplot(usgdp2 , level = NULL ) +
labs(title = "GDP Prediction ETS(M,A,N) vs. ARIMA(1,1,0)")