library(fpp3)
## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.4
## ✔ dplyr 1.1.2 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.0 ✔ feasts 0.3.1
## ✔ lubridate 1.9.2 ✔ fable 0.3.3
## ✔ ggplot2 3.4.4 ✔ fabletools 0.3.4
## Warning: package 'tsibble' was built under R version 4.3.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Do the exercises 9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8 in Hyndman. Please submit both the Rpubs link as well as your .rmd file.
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
Among these figures the boundaries and the length of the lags decrease as the number of random numbers increase. All the data is between the blue dashed lines therefore it indicates that the data are white noise.
The critical values are at different distances from the mean of zero because of the amount of data in each time series. The larger the series the smaller the critical value and the less correlated the values become. For the smaller set of random numbers, there is a stronger positive or negative correlation between the lag values. The autocorrelations differ in each figure because of the amount of random numbers in the given plot. It is all white noise however, because the numbers are random.
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
gafa_stock |>
filter(Symbol == "AMZN") |>
gg_tsdisplay(Close, plot_type='partial') +
labs(title = "Amazon Closing Stock Price")
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
The plots above show that the Amazon Closing Price is non stationary because the ACF plot has lag values that are large and decreasing slowly. A stationary time series would have an ACF plot that quickly converges to zero. The time series plot shows an increasing trend, therefore it is not stationary. The PACF for a stationary time series would be 0 for all of the lags. The PACF graph in this example shows a lag value at 1.
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
global_economy |>
filter (Country == "Turkey") |>
gg_tsdisplay(GDP, plot_type='partial')
lambda <- global_economy |>
filter(Country == "Turkey") |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
global_economy |>
filter(Country == "Turkey") |>
features(box_cox(GDP, lambda), unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 Turkey 1
The above code shows that the appropriate number of differencing to make the data stationary is one.
aus_accommodation |>
filter(State == "Tasmania") |>
autoplot(Takings)
lambda <- aus_accommodation |>
filter(State == "Tasmania") |>
features(Takings, features = guerrero) |>
pull(lambda_guerrero)
aus_accommodation |>
filter(State == "Tasmania") |>
features(box_cox(Takings, lambda), unitroot_ndiffs)
## # A tibble: 1 × 2
## State ndiffs
## <chr> <int>
## 1 Tasmania 1
The data needs to be differenced once to become stationary.
souvenirs |>
autoplot()
## Plot variable not specified, automatically selected `.vars = Sales`
lambda <- souvenirs |>
features(Sales, features = guerrero) |>
pull(lambda_guerrero)
souvenirs |>
features(box_cox(Sales, lambda), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
The amount of times this time series needs to be differenced is once.
For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
set.seed(12345678)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |>
autoplot()
## Plot variable not specified, automatically selected `.vars = Turnover`
The series has trend and seasonality and in order to use an ARIMA model.
Because of the variance in the data I will apply the BoxCox
transformation.
myseries |>
features(Turnover, unitroot_ndiffs)
## # A tibble: 1 × 3
## State Industry ndiffs
## <chr> <chr> <int>
## 1 Northern Territory Clothing, footwear and personal accessory retailing 1
# transformed plot
myseries %>%
gg_tsdisplay(difference(Turnover, 12), plot_type='partial', lag = 36)
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
lambda <- myseries |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
myseries |>
features(box_cox(Turnover, lambda), unitroot_ndiffs)
## # A tibble: 1 × 3
## State Industry ndiffs
## <chr> <chr> <int>
## 1 Northern Territory Clothing, footwear and personal accessory retailing 1
# Box Cox
myseries %>%
gg_tsdisplay(difference(box_cox(Turnover,lambda), 12), plot_type='partial', lag = 36)
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
The box cox transformation shows a more stationary time series after a singe difference.
Simulate and plot some data from simple ARIMA models.
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim
## # A tsibble: 100 x 2 [1]
## idx y
## <int> <dbl>
## 1 1 0
## 2 2 -2.38
## 3 3 -1.68
## 4 4 -0.118
## 5 5 -0.823
## 6 6 -1.25
## 7 7 0.234
## 8 8 1.24
## 9 9 3.30
## 10 10 1.45
## # ℹ 90 more rows
sim |>
autoplot()
## Plot variable not specified, automatically selected `.vars = y`
for(i in 2:100)
y[i] <- 0.9*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim |>
autoplot()
## Plot variable not specified, automatically selected `.vars = y`
for(i in 2:100)
y[i] <- 0.2*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim |>
autoplot()
## Plot variable not specified, automatically selected `.vars = y`
The smaller the phi value the period of the time series (The difference
between the peaks of the waves) decreases, as the phi value increases,
the period also increases.
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
fit <- aus_airpassengers |>
filter(Year < 2012) |>
model(ARIMA(Passengers))
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8756
## s.e. 0.0722
##
## sigma^2 estimated as 4.671: log likelihood=-87.8
## AIC=179.61 AICc=179.93 BIC=182.99
The model chosen is ARIMA(0,2,1)
fit |>
forecast(h=10) |>
autoplot(aus_airpassengers)
fit |>
gg_tsresiduals()
The residuals show a normal distribution and the ACF plot shows lags
that are within the blue lines.
\(y_t = -0.8756e_{t-1} +e_t\)
\((1-B)^dy_t = (1- 0.8756B)e_t\)
fit1 <-aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ pdq(0,1,0)))
fit1 %>%
forecast(h=10) %>%
autoplot(aus_airpassengers)
fit1 %>%
gg_tsresiduals() +
labs(title = "Residuals for ARIMA(0,1,0)")
The residuals are more normally distributed and the ACF shows more
positively correlated values. Part A forecasts values higher than the
actual time series. The forecast in part b shows the forecast below the
actual values in the time series.
fit2 <-aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ 1+ pdq(2,1,2)))
fit2 %>%
forecast(h=10) %>%
autoplot(aus_airpassengers)
fit2 %>%
gg_tsresiduals() +
labs(title = "Residuals for ARIMA(2,1,2)")
When the constant is removed, a warning is produced because of the
non-stationary AR part of from the CSS.
# Remove the constant
fit3 <- aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
report(fit3)
## Series: Passengers
## Model: NULL model
## NULL model
# Fix the warning
# https://stackoverflow.com/questions/7233288/non-stationary-seasonal-ar-part-from-css-error-in-r Referenced for the error - changed method to Maximum Likelihood.
# The coefficients should be between -1 and 1 for the process to be stationary.
fit4 <- aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ 0 + pdq(2,1,2), method = "ML"))
report(fit4)
## Series: Passengers
## Model: ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 1.8128 -0.8154 -1.9642 0.9999
## s.e. 0.1321 0.1320 0.1295 0.1310
##
## sigma^2 estimated as 4.203: log likelihood=-88.19
## AIC=186.37 AICc=188.09 BIC=194.94
The inclusion of a constant in a non-stationary ARIMA model is equivalent to inducing a polynomial trend of order \(d\) in the forecasts. If the constant is to be ommitted the forecast include a polynomial trend \(d-1\).
fit5 <- aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ 0 + pdq(2,0,2)))
report(fit5)
## Series: Passengers
## Model: ARIMA(2,0,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 0.1590 0.8369 1.0569 0.128
## s.e. 0.2553 0.2546 0.2658 0.179
##
## sigma^2 estimated as 6.64: log likelihood=-100.21
## AIC=210.43 AICc=212.09 BIC=219.12
fit4 <- aus_airpassengers %>%
filter(Year < 2012) %>%
model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend.
## This is generally discouraged, consider removing the constant or reducing the number of differences.
report(fit4)
## Series: Passengers
## Model: ARIMA(0,2,1) w/ poly
##
## Coefficients:
## ma1 constant
## -1.0000 0.0721
## s.e. 0.0709 0.0260
##
## sigma^2 estimated as 4.086: log likelihood=-85.74
## AIC=177.48 AICc=178.15 BIC=182.55
A warning is produced that the model specification induces a quadratic or higher order trend. It then encourages me to remove the constant or reduce the number of differences.
For the United States GDP series (from global_economy):
global_economy |>
filter(Country == "United States") |>
autoplot()
## Plot variable not specified, automatically selected `.vars = GDP`
lambda <-global_economy |>
filter(Country == "United States") |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
fit <- global_economy |>
filter(Country == "United States") |>
model(ARIMA(box_cox(GDP, lambda)))
report(fit)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
## Transformation: box_cox(GDP, lambda)
##
## Coefficients:
## ar1 constant
## 0.4586 118.1822
## s.e. 0.1198 9.5047
##
## sigma^2 estimated as 5479: log likelihood=-325.32
## AIC=656.65 AICc=657.1 BIC=662.78
global_economy |>
filter(Country == "United States") |>
features(box_cox(GDP, lambda), unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 United States 1
global_economy |>
filter(Country == "United States") |>
gg_tsdisplay(difference(GDP, 1),
plot_type='partial', lag=36) +
labs(title = "Differenced", y="")
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Here we see in the pcf graph that only 1 of the lags is out of the significance boundary, so we can select q to be 1. In the ACF graph the first lag is out of the significant limit, so we can select p to be 1. Using the unitdiff function, we found that the optimal value for d is 1.
fit <- global_economy |>
filter(Country == "United States") |>
select(GDP, Year) |>
model(
arima011 = ARIMA(box_cox(GDP,lambda) ~ pdq(0,1,1)),
arima110 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,0)),
arima111 = ARIMA(box_cox(GDP,lambda) ~ pdq(1,1,1)),
auto = ARIMA(box_cox(GDP,lambda), stepwise = FALSE, approx = FALSE)
)
fit |> pivot_longer(everything(), names_to = "Model name",
values_to = "Orders")
## # A mable: 4 x 2
## # Key: Model name [4]
## `Model name` Orders
## <chr> <model>
## 1 arima011 <ARIMA(0,1,1) w/ drift>
## 2 arima110 <ARIMA(1,1,0) w/ drift>
## 3 arima111 <ARIMA(1,1,1) w/ drift>
## 4 auto <ARIMA(1,1,0) w/ drift>
glance(fit) |> arrange(AICc) |> select(.model:BIC)
## # A tibble: 4 × 6
## .model sigma2 log_lik AIC AICc BIC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima110 5479. -325. 657. 657. 663.
## 2 auto 5479. -325. 657. 657. 663.
## 3 arima011 5689. -326. 659. 659. 665.
## 4 arima111 5580. -325. 659. 659. 667.
fit |> select(arima110) %>% gg_tsresiduals()
US_GDP <- global_economy |>
filter(Country == "United States") |>
select(GDP, Year)
fit |>
forecast(h=10) |>
filter(.model == 'arima110') |>
autoplot(US_GDP) +
labs(title = "Forecasted United States GDP using ARIMA(1,1,0)") +
theme(plot.title = element_text(hjust = 0.5))
US_ETS <- US_GDP |>
model(ETS(GDP))
report(US_ETS)
## Series: GDP
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9990876
## beta = 0.5011949
##
## Initial states:
## l[0] b[0]
## 448093333334 64917355687
##
## sigma^2: 7e-04
##
## AIC AICc BIC
## 3190.787 3191.941 3201.089
The AICc for arima110 is 657.1005. The AICc for the ETS model is 3191.941, which is significantly higher.