library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.2
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.3 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## Warning: package 'tsibble' was built under R version 4.4.2
## Warning: package 'tsibbledata' was built under R version 4.4.2
## Warning: package 'feasts' was built under R version 4.4.2
## Warning: package 'fabletools' was built under R version 4.4.2
## Warning: package 'fable' was built under R version 4.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
In this homework assignment I will be submitting exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 from the Hyndman online Forecasting book.
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
knitr::include_graphics("C:\\Users\\chung\\School\\624\\wnacfplus-1.png")
Each of these figures indicate the data are white noise. For all of them there are no autocorrelation coefficients outside of the limits. The differences among the figures is that the confidence levels (dashed lines) have greater ranges with a smaller amount of random numbers generated.
The critical value distance from the mean of zero for the above figures are each different due to the different samples sizes (36, 360, and 1,000). In essence, when the sample size increases, a smaller correlation is needed to determine that the correlation found is not white noise. Given the formula for the confidence level of 95 is +- 1.96/sqrt(N), when N (sample size) increases, the range of the critical values decreases.
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
amzn_stock <- gafa_stock %>%
filter(Symbol == "AMZN")
amzn_stock %>%
gg_tsdisplay(Close, plot_type = 'partial')
A stationary stock price has no trend or seasonality. The time plot shows a clear positive trend for a majority of the years covered. The ACF plot also confirmed trend, it shows a very slow decrease of autocorrelation from lag to lag. If the data was stationary the ACF plot should quickly approach zero and fluctuate around it. In the PACF plot at lag k = 1, there is no effect of correlation to remove, so k = 1 PACF and ACF are expected to be the same, however the PACF k = 1 indicates very high autocorrelation. The extreme value here can indicate trend.
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
turkish_gdp <- global_economy %>%
filter(Country == "Turkey")
turkish_gdp %>%
gg_tsdisplay(GDP, plot_type = 'partial')
First I will find the appropriate box-cox transformation with the Guerrero feature.
lambda <- turkish_gdp %>%
features(GDP, features = guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.1572187
The optimal lambda for box-cox is 0.1572187.
turkish_gdp %>%
features(box_cox(GDP, lambda), unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 Turkey 1
After transformation we find that the appropriate order of differencing is 1. For the Turkish dataset our index is in years, so no seasonal differencing is required.
## Showing time plot, acf and pacf after box-cox and differencing
turkish_gdp %>%
gg_tsdisplay(difference(box_cox(GDP, lambda), 1), plot_type = 'partial')
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
The plots above now show stationary data.
tasmania_acc <- aus_accommodation %>%
filter(State == "Tasmania")
tasmania_acc %>%
gg_tsdisplay(Takings, plot_type = 'partial')
The plots above show trend and seasonality. First I will apply box-cox
then seasonal differencing.
lambda <- tasmania_acc %>%
features(Takings, features = guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.001819643
The optimal lambda for this box-cox transformation is 0.001819643
tasmania_acc %>%
gg_tsdisplay(box_cox(Takings, lambda), plot_type = 'partial')
tasmania_acc %>%
features(box_cox(Takings, lambda), unitroot_nsdiffs)
## # A tibble: 1 × 2
## State nsdiffs
## <chr> <int>
## 1 Tasmania 1
After box-cox transformation we still see trend, and unitroot_nsdiffs suggestes 1 order of differencing. I will do a seasonal differencing for quarterly data.
tasmania_acc %>%
gg_tsdisplay(difference(box_cox(Takings, lambda), 4), plot_type = 'partial')
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
tasmania_acc %>%
features(difference(box_cox(Takings, lambda), 4), unitroot_ndiffs)
## # A tibble: 1 × 2
## State ndiffs
## <chr> <int>
## 1 Tasmania 0
After the box-cox transformation and 1st order seasonal differencing (lag 4 for quarterly data) our data seems to be stationary with unitroot_ndiffs implying 0 additional differencing.
monthly_sales <- souvenirs
monthly_sales %>%
gg_tsdisplay(Sales)
The plots above show strong seasonality, the acf shows a spike at lag 12 (yearly, lag 12 on monthly data).
lambda <- monthly_sales %>%
features(Sales, features = guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.002118221
The appropriate lambda for our box-cox transformation of monthly sales is 0.002118221
monthly_sales %>%
gg_tsdisplay(box_cox(Sales, lambda))
monthly_sales %>%
features(box_cox(Sales, lambda), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
The trend, and seasonality (increase near end of year) indicates that I need do seasonal differencing first then check if additional differencing is needed.
monthly_sales %>%
gg_tsdisplay(difference(box_cox(Sales, lambda), 12))
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
monthly_sales %>%
features(difference(box_cox(Sales, lambda), 12), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
The monthly sales data is now stationary after a box-cox transformation and a 1st order seasonal differencing.
For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
set.seed(123)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
## Displaying time series data. Here we see it is monthly, has a positive trend, and shows seasonality (increasing turnover near the end of the year).
myseries %>%
gg_tsdisplay()
## Plot variable not specified, automatically selected `y = Turnover`
I will find the appropriate box-cox transformation with the guerrero feature then perform a 1st order differencing for seasonality. I will then reassess if additional differincing is needed.
lambda <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.2151641
The appropriate lambda is 0.2151641. I will apply the box-cox and initiate seasonal differincing, then show gg_tsdisplay again.
myseries %>%
gg_tsdisplay(difference(box_cox(Turnover, lambda), 12))
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
myseries %>%
features(difference(box_cox(Turnover, lambda), 12), unitroot_nsdiffs)
## # A tibble: 1 × 3
## State Industry nsdiffs
## <chr> <chr> <int>
## 1 Victoria Household goods retailing 0
After applying box-cox and seasonal differincing the plots above show that we have removed the trend, however the ACF plot is interesting. There are a lot of autocorrelation values outside of the critical values, perhaps due to cyclicity of the data. There does not seem to be a pattern in the ACF.
Simulate and plot some data from simple ARIMA models.
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
set.seed(123)
y2 <- numeric(100)
y3 <- numeric(100)
e2 <- rnorm(100)
e3 <- rnorm(100)
for(i in 2:100){
y2[i] <- -0.6*y2[i-1] + e2[i]
y3[i] <- 0.9*y3[i-1] + e3[i]
}
sim <- tsibble(idx = seq_len(100), y = y, y2 = y2, y3 = y3, index = idx)
plot1 <- sim %>% autoplot(y2) + labs( title = "Phi = -0.6")
plot2 <- sim %>% autoplot(y) + labs( title = "Phi = 0.6")
plot3 <- sim %>% autoplot(y3) + labs( title = "Phi = 0.9")
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(plot1, plot2, plot3, nrow = 2)
In the above data where there is no constant: when phi is 0 the AR(1) model will show white noise, only the error will influence y. As phi approaches 1, the importance of the previous observation will increase, thus smoothing out the points. Without a constant this will create a random walk. As phi is more and more negative, we will see an oscillating plot, as y will be correlated with the negative of yt-1.
set.seed(123)
y4 <- numeric(100)
e4 <- rnorm(100)
for(i in 2:100)
y4[i] <- 0.6*e4[i-1] + e4[i]
sim <- tsibble(idx = seq_len(100), y4 = y4, index = idx)
set.seed(123)
y5 <- numeric(100)
y6 <- numeric(100)
e5 <- rnorm(100)
e6 <- rnorm(100)
for(i in 2:100){
y5[i] <- -0.6*e5[i-1] + e5[i]
y6[i] <- 0.9*e6[i-1] + e6[i]
}
sim2 <- tsibble(idx = seq_len(100), y4 = y4, y5 = y5, y6 = y6, index = idx)
plot4 <- sim2 %>% autoplot(y5) + labs( title = "Theta = -0.6")
plot5 <- sim2 %>% autoplot(y4) + labs( title = "Theta = 0.6")
plot6 <- sim2 %>% autoplot(y6) + labs( title = "Theta = 0.9")
grid.arrange(plot4, plot5, plot6, nrow = 2)
With no constant such as above, as theta increases from 0 to 1, the influence of the previous observation’s forecast error increases and smooths out the points. If theta is 0, then we will have a white noise plot, as y would be equal to the error alone. As theta is negative we will see an osculating plot as y would incorporate the negative of the previous observation’s error.
set.seed(123)
y7 <- (numeric(100))
e7 <- rnorm(100)
for(i in 2:100)
y7[i] <- 0.6*y7[i-1] + 0.6*e7[i-1] + e7[i]
sim3 <- tsibble(idx = seq_len(100), y7 = y7, index = idx)
set.seed(123)
y8 <- (numeric(100))
e8 <- rnorm(100)
for(i in 3:100)
y8[i] <- -0.8*y8[i-1] + 0.3*y8[i-2] + e8[i]
sim4 <- tsibble(idx = seq_len(100), y8 = y8, index = idx)
plot7 <- sim3 %>% autoplot(y7) + labs( title = "ARMA(1,1)")
plot8 <- sim4 %>% autoplot(y8) + labs( title = "AR(2)")
grid.arrange(plot7, plot8, nrow = 2)
In our ARMA(1,1) we get a relatively smooth model. In this model we gave the phi and theta values the same weights, thus neither the previous observation or error strongly outweigh each other. Both phi and theta are positive, in consequence we do not observe osculating behavior. In our AR(2) model we observe that the model’s variance increases over time, and osculates around 0. The osculation is due to the negative phi 1 value, and because the parameters allow the model to be non-stationary.
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
fit <- aus_airpassengers %>%
model(ARIMA(Passengers))
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
The appropriate ARIMA model is ARIMA(0,2,1). This is a second order differenced, MA(1) model
fit %>% gg_tsresiduals()
The residuals look like white noise. No apparent pattern shows in the
time plot, all autocorrelations are within the critical values. The
residual frequency plot is nearly normal.
Plotting forcast for 10 years using ARIMA(0,2,1)
fit %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
The model expressed with the backshift operator is: ((1−B)^2)Yt= ϵt +Θ1ϵt-1
fit2 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ 1 + pdq(0,1,0)))
report(fit2)
## Series: Passengers
## Model: ARIMA(0,1,0) w/ drift
##
## Coefficients:
## constant
## 1.4191
## s.e. 0.3014
##
## sigma^2 estimated as 4.271: log likelihood=-98.16
## AIC=200.31 AICc=200.59 BIC=203.97
fit2 %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
The two forecasts are both positive in direction and seem to continue the recent trend. Our ARIMA(0,1,0) is a random walk with drift due to the constant. Out ARIMA (0,2,1) shows a greater point estimate, and the two show generally similar confidence intervals.
fit3 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ 1 + pdq(2,1,2)))
report(fit3)
## Series: Passengers
## Model: ARIMA(2,1,2) w/ drift
##
## Coefficients:
## ar1 ar2 ma1 ma2 constant
## -0.5518 -0.7327 0.5895 1.0000 3.2216
## s.e. 0.1684 0.1224 0.0916 0.1008 0.7224
##
## sigma^2 estimated as 4.031: log likelihood=-96.23
## AIC=204.46 AICc=206.61 BIC=215.43
fit3 %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
Comparing this model versus the models in part a and c, shows very similar forecasts. The ARIMA(2,1,2) seems to have more variation in the point estimate line, likely due to the second order of the AR and MA pieces.
fit4 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
Removing the constant results in error. This makes sense logically as we can see there is a trend in the data and removing the constant does not align with the drift we should be accounting for in the model.
fit5 <- aus_airpassengers %>%
model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
report(fit5)
## Series: Passengers
## Model: ARIMA(0,2,1) w/ poly
##
## Coefficients:
## ma1 constant
## -1.0000 0.0571
## s.e. 0.0585 0.0213
##
## sigma^2 estimated as 3.855: log likelihood=-95.1
## AIC=196.21 AICc=196.79 BIC=201.63
fit5 %>% forecast(h=10) %>%
autoplot(aus_airpassengers)
Plotting an ARIMA(0,2,1) with constant triggers a warning from R that a quardratic or higher order polynomial trend is induced. This happens because when a constant to a >1 order of difference is introduced a drift is created that accumulates at an increasing rate when transitioning from the differenced series back to original scale.
For the United States GDP series (from global_economy):
US_gdp <- global_economy %>%
filter(Code == "USA") %>%
select(GDP)
US_gdp %>% autoplot()
## Plot variable not specified, automatically selected `.vars = GDP`
lambda <- US_gdp %>%
features(GDP,features=guerrero) %>%
pull(lambda_guerrero)
lambda
## [1] 0.2819443
The appropriate lambda for the box-cox transformation is 0.2819443
US_gdp %>%
autoplot((box_cox(GDP, lambda)))
Although there was not much variance, the box-cox transformation smoothed out the curvature of the previous plot and reduced the steepness of the drop around the 2008 year.
fit6 <- US_gdp %>%
model(ARIMA(box_cox(GDP, lambda)))
report(fit6)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
## Transformation: box_cox(GDP, lambda)
##
## Coefficients:
## ar1 constant
## 0.4586 118.1822
## s.e. 0.1198 9.5047
##
## sigma^2 estimated as 5479: log likelihood=-325.32
## AIC=656.65 AICc=657.1 BIC=662.78
The ARIMA() function indicates an ARIMA(1,1,0) model with drift is appropriate.
US_gdp %>%
features(box_cox(GDP, lambda), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
fit_others <- US_gdp %>%
model(
"ARIMA(2,1,2)" = ARIMA(box_cox(GDP, lambda) ~ 1 + pdq(2,1,2)),
"ARIMA(0,1,0)" = ARIMA(box_cox(GDP, lambda) ~ 1 + pdq(0,1,0)),
"ARIMA(1,1,0)" = ARIMA(box_cox(GDP, lambda) ~ 1 + pdq(1,1,0)),
"ARIMA(1,1,1)" = ARIMA(box_cox(GDP, lambda) ~ 1 + pdq(1,1,1))
)
fit_others %>% glance()
## # A tibble: 4 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(2,1,2) 5734. -325. 662. 664. 674. <cpl [2]> <cpl [2]>
## 2 ARIMA(0,1,0) 6774. -332. 668. 668. 672. <cpl [0]> <cpl [0]>
## 3 ARIMA(1,1,0) 5479. -325. 657. 657. 663. <cpl [1]> <cpl [0]>
## 4 ARIMA(1,1,1) 5580. -325. 659. 659. 667. <cpl [1]> <cpl [1]>
For the other models I choose to have a constant in all the models due to the strong trend aspect seen in the time plot. Due to the constant, I left the differencing at 1st order. Out of the models tested, the suggested ARIMA(1,1,0) with drift performed the best- showing the lowest AICc.
fit6 %>% gg_tsresiduals()
The residuals are evenly distributed with no apparent patterns. The ACF plot shows all autocorrelations within the critical values. The residuals look like white noise.
fit6 %>%
forecast(h=10) %>%
autoplot(US_gdp)
The forecast follows the established trend, however in a logical sense a GDP that continues to increase faster and faster does not seem reasonable to me. The forecast is less believable for me as the forecast moves further into the future.
models <- US_gdp %>%
model(
"ARIMA(1,1,0)" = ARIMA(box_cox(GDP, lambda) ~ 1 + pdq(1,1,0)),
ETS = ETS(GDP)
)
models %>% glance()
## # A tibble: 2 × 11
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots MSE AMSE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list> <dbl> <dbl>
## 1 ARIMA(1… 5.48e+3 -325. 657. 657. 663. <cpl> <cpl> NA NA
## 2 ETS 6.78e-4 -1590. 3191. 3192. 3201. <NULL> <NULL> 2.79e22 1.20e23
## # ℹ 1 more variable: MAE <dbl>
models %>%
forecast(h = 10) %>%
autoplot(US_gdp)
The confidence interval for the ETS model is far wider than the ARIMA model and the ETS model forecasts a slower GDP growth. This is due to the AR piece of the ARIMA model taking into account the previous observation. The AICc is much larger for our ETS model compared to our ARIMA model, indicating that the ARIMA produces a better model.