library(tsibble)
## Warning: package 'tsibble' was built under R version 4.4.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.4.3
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tibble 3.2.1 ✔ ggplot2 3.5.1
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.2
## ✔ lubridate 1.9.3 ✔ fable 0.5.0
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'tsibbledata' was built under R version 4.4.3
## Warning: package 'feasts' was built under R version 4.4.3
## Warning: package 'fabletools' was built under R version 4.4.3
## Warning: package 'fable' was built under R version 4.4.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ lubridate::interval() masks tsibble::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
Figure 9.32 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers.
The figures show different critical values and ACF distributions. They all indicate that there is white noise since the spikes are within the critical values which are shown with the blue dashed lines.
The critical values are at different distances since the length of the time series determines how close we are to the mean. The larger the length, the closer the critical values are to the mean. The ACF values will decrease along with critical values since the size of a random data et increases and probability of correlation decreases.
A classic example of a non-stationary series are stock prices. Plot the daily closing prices for Amazon stock (contained in gafa_stock), along with the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced.
amazon_close <- gafa_stock |>
filter(Symbol == "AMZN") |>
select(Date, Close)
amazon_close |>
gg_tsdisplay(Close, plot_type = 'partial')
## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
The plots show that the series is non-stationary since the data shows a
trend while a stationary series would not have a trend. There is an
upward trend until 2018. The acf plot suggest that the series is
non-stationary because the values are large and have a very slow decline
while a stationary series acf would have a more rapid decline. The pacf
plot also indicates a non-stationary series since it is sinusoidal.
For the following series, find an appropriate Box-Cox transformation and order of differencing in order to obtain stationary data.
# Filter and view data
turkish_gdp <- global_economy |>
filter(Code == "TUR") |>
select(Year, GDP)
turkish_gdp |> autoplot()
## Plot variable not specified, automatically selected `.vars = GDP`
# Box plot transformation
# Lambda
lambda_turk <- turkish_gdp |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
# Box-plot
turkish_gdp_bc <- turkish_gdp |>
mutate(bc = box_cox(GDP, lambda_turk)) |>
select(Year, bc)
print(lambda_turk)
## [1] 0.1572187
# Seasonal differencing test
turkish_gdp_bc |>
features(bc, unitroot_nsdiffs)
No seasonal differencing are noted so I will test of appropriate order of differencing next.
# test for order of differencing
turkish_gdp_bc |>
features(bc, unitroot_ndiffs)
The appropriate box-cox transformation has a lambda of 0.1572 and one order of differencing with no seasonal differencing.
# Plot transformed bc data
turkish_gdp_bc |>
mutate(diff_bc = difference(bc)) |>
autoplot(diff_bc)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
# Select and plot data
tasmania_takings <- aus_accommodation |>
filter(State == "Tasmania") |>
select(Date, Takings)
tasmania_takings |>
autoplot(Takings)
# Find lambda
lambda_tasmania <- tasmania_takings |>
features(Takings, features = guerrero) |>
pull(lambda_guerrero)
# Box-cox transformation
tasmania_takings_bc <- tasmania_takings |>
mutate(bc = box_cox(Takings, lambda_tasmania)) |>
select(Date, bc)
print(lambda_tasmania)
## [1] 0.001819643
# Seasonal differencing test
tasmania_takings_bc |>
features(bc, unitroot_nsdiffs)
Seasonal differencing is suggested
# Apply seasonal differencing and test for order of differencing
tasmania_takings_bc |>
mutate(seas_diff = difference(bc, 4)) |>
features(seas_diff, unitroot_ndiffs)
The appropriate box-cox transformation is a lambda of 0.0018 and one seasonal difference. Data is plotted below to make sure it looks stationary.
tasmania_takings_bc |>
mutate(seas_diff = difference(bc, 4)) |>
autoplot(seas_diff)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
# Select and plot data
monthly_souvenirs <- souvenirs
monthly_souvenirs |>
autoplot(Sales)
# Find lambda
lambda_souvenirs <- monthly_souvenirs |>
features(Sales, features = guerrero) |>
pull(lambda_guerrero)
# Box-cox transformation
monthly_souvenirs_bc <- monthly_souvenirs |>
mutate(bc = box_cox(Sales, lambda_souvenirs)) |>
select(Month, bc)
print(lambda_souvenirs)
## [1] 0.002118221
# Seasonal differencing test
monthly_souvenirs_bc |>
features(bc, unitroot_nsdiffs)
Seasonal differencing is suggested
# Apply seasonal differencing and test for order of differencing
monthly_souvenirs_bc |>
mutate(seas_diff = difference(bc, 12)) |>
features(seas_diff, unitroot_ndiffs)
The appropriate box cox transformation has a lambda of 0.0021 and one seasonal difference. The data is plotted below to ensure data looks stationary.
monthly_souvenirs_bc |>
mutate(seas_diff = difference(bc, 12)) |>
autoplot(seas_diff)
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
For your retail data (from Exercise 7 in Section 2.10), find the appropriate order of differencing (after transformation if necessary) to obtain stationary data.
# code from textbook
set.seed(12345678)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
# select data and plot
myseries <- myseries |>
select(Month, Turnover)
myseries |>
autoplot(Turnover)
The data does not appear to show constant variation so a boxcox
transformation is appopriate.
# Find lambda
lambda_myseries <- myseries |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
# Box-cox transformation
myseries_bc <- myseries |>
mutate(bc = box_cox(Turnover, lambda_myseries)) |>
select(Month, bc)
# Plot
myseries_bc |>
autoplot(bc)
The data now appears to show constant variation.
# Test for seasonal differencing
myseries_bc |>
features(bc, unitroot_nsdiffs)
Seasonal differencing is suggested.
# Apply seasonal differencing and test for order of differencing
myseries_bc |>
mutate(seas_diff = difference(bc, 12)) |>
features(seas_diff, unitroot_ndiffs)
No additional differencing suggested. The data is plotted below to make sure data looks stationary.
myseries_bc |>
mutate(seas_diff = difference(bc, 12)) |>
autoplot(seas_diff)
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
Simulate and plot some data from simple ARIMA models.
# Code provided by textbook
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
sim |> autoplot(y)
# Time plot after changing value of phi1
y_1 <- numeric(100)
y_2 <- numeric(100)
y_3 <- numeric(100)
e <-rnorm(100)
for(i in 2:100) {
y_1[i] <- 0.6*y_1[i-1] + e[i]
y_2[i] <- 0.95*y_1[i-1] + e[i]
y_3[i] <- 0.3*y_1[i-1] + e[i]
}
sim_multi <- tsibble(idx = seq_len(100), "phi = 0.6" = y_1, "phi = 0.99" = y_2, "phi=0.3" = y_3, index = idx)
sim_long <- sim_multi |>
pivot_longer(!idx, names_to = "phi", values_to = "y")
autoplot(sim_long, y)
The plots don’t seem to be too different from each other but the higher
phi value (0.99) seems to result in more extreme peaks and lows.
y_ma <- numeric(100)
e_ma <- rnorm(100)
for(i in 2:100)
y_ma[i] <- e_ma[i] + 0.6*e_ma[i-1]
sim_ma <- tsibble(idx = seq_len(100), y_ma = y_ma, index=idx)
sim_ma |> autoplot(y_ma)
# Time plot with variety of theta_1
y_1 <- numeric(100)
y_2 <- numeric(100)
y_3 <- numeric(100)
e <- rnorm(100)
for(i in 2:100) {
y_1[i] <- e[i] + 0.6*e[i-1]
y_2[i] <- e[i] + 0.99*e[i-1]
y_3[i] <- e[i] + 0.1*e[i-1]
}
sim_ma_multi <- tsibble(
idx = seq_len(100), 'theta = 0.6' = y_1, 'theta = 0.99' = y_2, 'theta = 0.1' = y_3, index = idx)
sim_ma_long <- sim_ma_multi |>
pivot_longer(!idx, names_to = 'theta', values_to = 'y')
autoplot(sim_ma_long, y)
The theta value closer to 1 seems to have more extreme peaks and lows in
the data plot than the other values.
y_arma <- numeric(100)
e_arma <- rnorm(100)
for(i in 2:100)
y_arma[i] <- 0.6*y[i-1] + 0.6*e_arma[i-1] + e_arma[i]
sim_arma <- tsibble(idx = seq_len(100), y_arma =y_arma, index=idx)
y_ar2 <- numeric(100)
e_ar2 <- rnorm(100)
for(i in 3:100)
y_ar2[i] <- -0.8*y_ar2[i-1] + 0.3*y_ar2[i-2] + e_ar2[i]
sim_ar2 <- tsibble(idx = seq_len(100), y_ar2 =y_ar2, index=idx)
combined_data <- tsibble(
idx = seq_len(100), y_ar2 = y_ar2, y_arma = y_arma, index = idx)
combined_data <- combined_data |>
pivot_longer(!idx, names_to = 'model', values_to = 'y')
combined_data |> autoplot(y)
The AR2 model shows increasing variance over time. The ARMA1,1 model
appears very stable and actually just looks like a straight line at 0.
This is likely due to the extreme values of AR2. ARMA1,1 is plotted
below to verify.
sim_arma |> autoplot(y_arma)
Consider aus_airpassengers, the total number of passengers (in millions) from Australian air carriers for the period 1970-2011.
# plot data
autoplot(aus_airpassengers, Passengers)
Since there is no seasonal variation, a non-seasonal ARIMA would likely
be appropriate.
passengers_arima <- aus_airpassengers |>
model(ARIMA(Passengers))
report(passengers_arima)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8963
## s.e. 0.0594
##
## sigma^2 estimated as 4.308: log likelihood=-97.02
## AIC=198.04 AICc=198.32 BIC=201.65
The (0,2,1) model was selected
aus_airpassengers |>
model(ARIMA(Passengers)) |>
gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The residuals histogram is a relatively normal curve but there is an outlier on the right side. The acf plot shows all values within the critical values. The time plot of the residuals has a few spikes but the variance is mostly constant.
passengers_arima |>
forecast(h=10) |>
autoplot(aus_airpassengers)
The model in terms of backshift operator can be written as \[ (1-B)^2 y_t = (1 - 0.8963B)\epsilon_t \]
passengers_arima_c <- aus_airpassengers |>
model("pass_a"= ARIMA(Passengers ~ pdq(0,2,1)),
"pass_c" = ARIMA(Passengers ~ 1 + pdq(0,1,0)))
passengers_arima_c |> forecast(h=10) |>
autoplot(aus_airpassengers, level = NULL)
The ARIMA 0,1,0 with drift model has less steep slope than the 0,2,1
model. The 0,1,0 model seems to be a good model when looking at the
shape of the data as a whole but the 0,2,1 seems to be a better model
when considering the more recent steeper slope.
passengers_arima_d <- aus_airpassengers |>
model("pass_a" = ARIMA(Passengers ~ pdq(0,2,1)),
"pass_c" = ARIMA(Passengers~ 1 + pdq(0,1,0)),
"pass_d" = ARIMA(Passengers ~ 1 + pdq(2,1,2)))
passengers_arima_d |>
forecast(h=10) |>
autoplot(aus_airpassengers, level=NULL)
The 2,1,2 drift model seem to perform similarly to the 0,1,0 drift
model.
# Remove constant
passengers_arima_d2 <- aus_airpassengers |>
model('no_constant' = ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for no_constant
## [1] non-stationary AR part from CSS
passengers_arima_d2 |>
forecast(h=10) |>
autoplot(aus_airpassengers, level= NULL)
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_line()`).
The removal of the constant creates an error and unable to project a
model.
passengers_arima_e <- aus_airpassengers |>
model("pass_a" = ARIMA(Passengers ~ pdq(0,2,,1)),
"pass_e" = ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend. This
## is generally discouraged, consider removing the constant or reducing the number
## of differences.
passengers_arima_e |>
forecast(h=10) |>
autoplot(aus_airpassengers, level = NULL)
This model results in a warning that a quadratic or higher order
polynomial trend was induced and that removing the constant or reducing
the number of differences should be considered. The graph shows a very
steep projection compared to the other models.
For the United States GDP series (from global_economy):
# Select and plot data
us_gdp <- global_economy |>
filter(Code == "USA") |>
select(Year, GDP)
autoplot(us_gdp, GDP)
# Find lambda
lambda_gdp <- us_gdp |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
print(lambda_gdp)
## [1] 0.2819443
# box cox transformation
us_gdp_bc <- us_gdp |>
mutate(bc = box_cox(GDP, lambda_gdp)) |>
select(Year, bc)
us_gdp_bc |> autoplot(bc)
A box-cox transformation with a lambda of 0.2819 is appropriate for this
data
The data does not appear to be seasonal so a non-appropriate ARIMA model is appropriate.
us_gdp_arima <- us_gdp_bc |>
model(ARIMA(bc))
report(us_gdp_arima)
## Series: bc
## Model: ARIMA(1,1,0) w/ drift
##
## Coefficients:
## ar1 constant
## 0.4586 118.1822
## s.e. 0.1198 9.5047
##
## sigma^2 estimated as 5479: log likelihood=-325.32
## AIC=656.65 AICc=657.1 BIC=662.78
I will try the Hyndman-Khandakar algorithm.
# Test order for differencing
us_gdp_bc |> features(bc, unitroot_ndiffs)
# model
us_gdp_models <- us_gdp_bc |>
model('ARIMA010' = ARIMA(bc ~ 1 + pdq(0,1,0)),
'ARIMA212' = ARIMA(bc ~ 1 + pdq(2,1,2)),
'ARIMA110' = ARIMA(bc ~ 1 + pdq(1,1,0)),
'ARIMA011' = ARIMA(bc ~ 1 + pdq(0,1,1)))
glance(us_gdp_models)
glance(us_gdp_models) |>
arrange(AICc)
The best model has the lowest AICc which happens to be the ARIMA 1,1,0 model in this case.
us_gdp_bc |>
model(ARIMA(bc ~ 1 + pdq(1,1,0))) |>
gg_tsresiduals()
The residuals show a relatively constant variance and close to normal
distribution. All acf points are also within the critical value shown by
the blue dashed line.
us_gdp_bc |>
model(ARIMA(bc ~ 1 + pdq(1,1,0))) |>
forecast(h=10) |>
autoplot(us_gdp_bc)
The forecast seems reasonable since it follows the overall slope of the
existing data.
us_gdp_bc |>
model("ARIMA" = ARIMA(bc ~ 1+ pdq(1,1,0)),
"ETS" = ETS(bc ~ error("A") + trend("A") + season("N"))) |>
forecast(h=10) |>
autoplot(us_gdp_bc, level = NULL)
The results of the forecasts seem similar but the ARIMA model projects slightly higher values than ETS.
us_gdp_bc |>
slice(-n()) |>
stretch_tsibble(.init = 10) |>
model("ARIMA" = ARIMA(bc ~ 1 + pdq(1,1,0)),
"ETS" =ETS(bc ~ error("A") + trend("A") + season("N"))) |>
forecast(h=1) |>
accuracy(us_gdp_bc) |>
select(.model, RMSE:MAPE)
The ARIMA model has lower RMSE, MAE, and MAPE but higher MPE. Since models with lower RMSE are seen as better, ARIMA is the better model in this case.