9.1, 9.2, 9.3, 9.5, 9.6, 9.7, 9.8
library(fpp3)
## Registered S3 methods overwritten by 'ggtime':
## method from
## autolayer.fbl_ts fabletools
## autolayer.tbl_ts fabletools
## autoplot.dcmp_ts fabletools
## autoplot.fbl_ts fabletools
## autoplot.tbl_ts fabletools
## fortify.fbl_ts fabletools
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.2 ──
## ✔ tibble 3.3.0 ✔ tsibble 1.2.0
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.5.0
## ✔ lubridate 1.9.4 ✔ fable 0.5.0
## ✔ ggplot2 4.0.1
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
The difference is in the length of the autocorrelation spikes - they get progressively shorter around 0 with each plot. The other difference is in the confidence intervals boundaries (the blue dashed lines), where with each plot it gets narrower. The autocorrelations seem to shorten as the interval narrows down. All of the plots indicate that the data are white noise, as none of the spikes go outside of the boundary, meaning that all the autocorrelations are insignificantly different from zero.
This is because of the size of the data. On the left, there are only n = 36 observations, meaning that the autocorrelations will be estimated more poorly and have higher variance than a dataset with size n = 1000 (the right one). The more samples we have, the more reliable and closer to zero the estimates will be. It’s the same thing for the CI, where They also narrow with larger T because they are calculated as ±1.96/sqrt(T), so as T increases the bounds go toward zero.
amzn <- gafa_stock |>
filter(Symbol == "AMZN")
amzn |> gg_tsdisplay((Close), plot_type='partial')
## Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsdisplay()` instead.
## ℹ Graphics functions have been moved to the {ggtime} package. Please use
## `library(ggtime)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
## Provided data has an irregular interval, results should be treated with caution. Computing ACF by observation.
The original series plot shows that the data is non-stationary because we can observe a clear upwards trend - stationary series will NOT have a trend. This is because the mean is not dependent on time and will not increase over time, whereas in our case with non-stationary data, it does increase. The same plot shows that the series is not homoscedastic either, which is another requirement for the data to be considered stationary. For the second plot, in order for the data to be stationary, the ACF should drop to 0 very fast, but we can see that it’s decreasing towards 0 very slowly, meaning that the data is not stationary. That’s because for non-stationary data, the r1 value is very large and positive, like in this case. And for the third PACF plot, we can use it to identify the optimal number of terms to use in the AR model. The last significant spike is at lag 1, meaning that perhaps we can use an AR1 model. It needs differencing to be stationary, and all the plots indicate that first differencing should be used.
#a
turkey <- global_economy |>
filter(Country == "Turkey")
#Original series plot
turkey |> gg_tsdisplay((GDP), plot_type='partial') + labs(title = "Original Turkey GDP Series")
#Calculate lambda
lambda1 <- turkey |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
#Unit root test
turkey |> features(box_cox(GDP, lambda1), unitroot_nsdiffs)
## # A tibble: 1 × 2
## Country nsdiffs
## <fct> <int>
## 1 Turkey 0
turkey |> features(box_cox(GDP, lambda1), unitroot_ndiffs)
## # A tibble: 1 × 2
## Country ndiffs
## <fct> <int>
## 1 Turkey 1
#Differenced plot
turkey |> gg_tsdisplay(difference(box_cox(GDP, lambda1)), plot_type= 'partial') + labs(title = "Differenced Turkey GDP Series")
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
(b)
#b
tasmania <- aus_accommodation |>
filter(State == "Tasmania")
#Original series plot
tasmania |> gg_tsdisplay((Takings), plot_type='partial') + labs(title = "Original Tasmania Takings Series")
#Calculate lambda
lambda2 <- tasmania |>
features(Takings, features = guerrero) |>
pull(lambda_guerrero)
#Unit root tests
tasmania |> features(box_cox(Takings, lambda2), unitroot_nsdiffs)
## # A tibble: 1 × 2
## State nsdiffs
## <chr> <int>
## 1 Tasmania 1
tasmania |> features(box_cox(Takings, lambda2), unitroot_ndiffs)
## # A tibble: 1 × 2
## State ndiffs
## <chr> <int>
## 1 Tasmania 1
#Seasonally differenced
tasmania |> gg_tsdisplay(difference(box_cox(Takings, lambda2), 4),
plot_type= 'partial') + labs(title = "Seasonally Differenced Tasmania Takings Series")
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Seasonally and first Differenced plot
tasmania |> gg_tsdisplay(difference(box_cox(Takings, lambda2), 4) |> difference(),
plot_type= 'partial') + labs(title = "Double Differenced Tasmania Takings Series")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
The unit root tests indicated I need to do both a seasonal and first differencing for the Tasmania dataset in order to make it stationary. I did seasonal differencing first, and the first differencing, which made the data stationary.
#c
#Original series plot
souvenirs |> gg_tsdisplay((Sales), plot_type='partial') + labs(title = "Original Souvenirs Monthly Sales Series")
#Calculate lambda
lambda3 <- souvenirs |>
features(Sales, features = guerrero) |>
pull(lambda_guerrero)
#Unit root tests
souvenirs |> features(box_cox(Sales, lambda3), unitroot_nsdiffs)
## # A tibble: 1 × 1
## nsdiffs
## <int>
## 1 1
souvenirs |> features(box_cox(Sales, lambda3), unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 1
#Once Differenced plot
souvenirs |> gg_tsdisplay(difference(box_cox(Sales, lambda3), 12),
plot_type= 'partial') + labs(title = "Seasonally Differenced Souvenirs Monthly Sales Series")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Twice Differenced plot
souvenirs |> gg_tsdisplay(difference(box_cox(Sales, lambda3), 12) |> difference(),
plot_type= 'partial') + labs(title = "Double Differenced Souvenirs Monthly Sales Series")
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Test to confirm that it is indeed stationary now
souvenirs |>
features(difference(difference(box_cox(Sales, lambda3), 12), 1), unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0381 0.1
The unit root tests indicated that the souvenirs data needs both seasonal and first differencing. I tried differencing seasonally once to see if that made the data stationary, however, the result was borderline. I think going along with the test results and differencing twice made it stationary, and that was confirmed with a KPSS test as well.
set.seed(33)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries |> gg_tsdisplay((Turnover), plot_type='partial') + labs(title = "Original Australian Retail Turnover Series")
#Calculate lambda
lambda4 <- myseries |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
#Unit root tests
myseries |> features(box_cox(Turnover, lambda4), unitroot_nsdiffs)
## # A tibble: 1 × 3
## State Industry nsdiffs
## <chr> <chr> <int>
## 1 Queensland Food retailing 1
myseries |> features(box_cox(Turnover, lambda4), unitroot_ndiffs)
## # A tibble: 1 × 3
## State Industry ndiffs
## <chr> <chr> <int>
## 1 Queensland Food retailing 1
#Once Differenced plot
myseries |> gg_tsdisplay(difference(box_cox(Turnover, lambda4), 12),
plot_type= 'partial') + labs(title = "Seasonally Differenced Australian Retail Turnover Series")
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Twice Differenced plot
myseries |> gg_tsdisplay(difference(box_cox(Turnover, lambda4), 12) |> difference(), plot_type= 'partial') +
labs(title = "Double Differenced Australian Retail Turnover Series")
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_point()`).
#Test to confirm that it is indeed stationary now
myseries |>
features(difference(difference(box_cox(Turnover, lambda4), 12), 1), unitroot_kpss)
## # A tibble: 1 × 4
## State Industry kpss_stat kpss_pvalue
## <chr> <chr> <dbl> <dbl>
## 1 Queensland Food retailing 0.0144 0.1
Seasonal and first differencing seemed to yield a better result over just the box-coxed transform seasonal differencing, however, the spikes are still slightly outside of the CI bounds for the most part, which indicates that there is more information left may be able to be handled by the ARIMA model. The KPSS test had a p-value of 0.1, meaning that we fail to reject the null hypothesis, and that the data is actually stationary.
#a
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- 0.6*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
#b
phi_values <- c(-1, -0.6, 0, 0.6, 1)
for(phi in phi_values) {
y <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y[i] <- phi*y[i-1] + e[i]
sim <- tsibble(idx = seq_len(100), y = y, index = idx)
print(sim |> autoplot(y) + labs(title = paste("AR(1) with phi =", phi)))
}
## Warning: `autoplot.tbl_ts()` was deprecated in fabletools 0.6.0.
## ℹ Please use `ggtime::autoplot.tbl_ts()` instead.
## ℹ Graphics functions have been moved to the {ggtime} package. Please use
## `library(ggtime)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
As ϕ1 increases, the graph of the data becomes smoother, and when ϕ1 approaches 1, the series starts to resemble a random walk. As ϕ1 approaches 0, the pattern becomes closer to white noise, and when ϕ1 is negative, the pattern oscillates rapidly around the mean. The shape changes with different values.
#c
y_ma <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y_ma[i] <- 0.6*e[i-1] + e[i]
sim_ma <- tsibble(idx = seq_len(100), y = y_ma, index = idx)
#d
theta_values <- c(-1, -0.6, 0, 0.6, 1)
for(theta in theta_values) {
y_ma <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y_ma[i] <- theta*e[i-1] + e[i]
sim_ma <- tsibble(idx = seq_len(100), y = y_ma, index = idx)
print(sim_ma |> autoplot(y) + labs(title = paste("MA(1) with theta =", theta)))
}
I don’t see that changing θ1 much affects the shape of the graph, it remains flat. The only difference I can see is that positive θ1 creates slightly smoother series with wider spaced peaks, and that negative θ1 causes more oscillation, but it’s very subtle.
#e
set.seed(33)
y_arma <- numeric(100)
e <- rnorm(100)
for(i in 2:100)
y_arma[i] <- 0.6*y_arma[i-1] + 0.6*e[i-1] + e[i]
sim_arma <- tsibble(idx = seq_len(100), y = y_arma, index = idx)
#f
set.seed(33)
y_ar2 <- numeric(100)
e <- rnorm(100)
for(i in 3:100)
y_ar2[i] <- -0.8*y_ar2[i-1] + 0.3*y_ar2[i-2] + e[i]
sim_ar2 <- tsibble(idx = seq_len(100), y = y_ar2, index = idx)
#g
sim_arma |> gg_tsdisplay(y, plot_type= "partial")
sim_ar2 |> gg_tsdisplay(y, plot_type= "partial")
The ARMA(1,1) model is stationary with stable variance. The ACF plot’s rapid decay to 0 shows that the series is stationary, and it also has a sinusoidal pattern. The last significant lag in the PACF is at lag 3. The AR2 is non-stationary and starts near 0, followed by a rapid growth of the oscillations. The ACF is also exhibiting this alternating between positive and negative values, while the PACF shows nothing after lag 1.
aus <- aus_airpassengers |>
filter(Year <= 2011)
#a
fit<- aus |> model(ARIMA(Passengers))
report(fit)
## Series: Passengers
## Model: ARIMA(0,2,1)
##
## Coefficients:
## ma1
## -0.8756
## s.e. 0.0722
##
## sigma^2 estimated as 4.671: log likelihood=-87.8
## AIC=179.61 AICc=179.93 BIC=182.99
#Residuals
fit |> gg_tsresiduals()
## Warning: `gg_tsresiduals()` was deprecated in feasts 0.4.2.
## ℹ Please use `ggtime::gg_tsresiduals()` instead.
## ℹ Graphics functions have been moved to the {ggtime} package. Please use
## `library(ggtime)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Forecast
fit |> forecast(h=10) |> autoplot(aus_airpassengers)
## Warning: `autoplot.fbl_ts()` was deprecated in fabletools 0.6.0.
## ℹ Please use `ggtime::autoplot.fbl_ts()` instead.
## ℹ Graphics functions have been moved to the {ggtime} package. Please use
## `library(ggtime)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
ARIMA(0,2,1) was picked by the ARIMA() function, the residuals look like white noise.
Substituting p = 0, d = 2, q = 1, θ1 = -0.8756, c = 0: (1 − 0)(1 − B)^2yt = 0 + (1 + (-0.8756)B^1)εt (1 − B)^2yt = (1 - 0.8756B)εt
#c
#ARIMA(0,1,0) with drift
fit2 <- aus |> model(ARIMA(Passengers ~ 1 + pdq(0,1,0)))
fit2 |> forecast(h=10) |> autoplot(aus_airpassengers)
fit2 |> gg_tsresiduals()
Compared to part (a), the forecast trend is less steep and the prediction intervals are narrower, since we have d=1 here versus d=2 in part (a). Higher d leads to wider prediction intervals that expand more rapidly. This forecast also predicted under the actual values, while the one from (a) predicted above the actual values.
#d
#ARIMA(2,1,2) with drift
fit3 <- aus |> model(ARIMA(Passengers ~ 1 + pdq(2,1,2)))
fit3 |> forecast(h=10) |> autoplot(aus_airpassengers)
fit3 |> gg_tsresiduals()
The predicted values appear closer to the actual values than the other two forecasts, and the trend is a bit steeper.
#without constant
fit4 <- aus |> model(ARIMA(Passengers ~ 0 + pdq(2,1,2)))
## Warning: 1 error encountered for ARIMA(Passengers ~ 0 + pdq(2, 1, 2))
## [1] non-stationary AR part from CSS
fit4 |> forecast(h=10) |> autoplot(aus_airpassengers)
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 10 rows containing missing values or values outside the scale range
## (`geom_line()`).
Removing the constant causes the model to produce a non-stationary AR component, which makes it impossible to fit reliably.
#e
#ARIMA(0,2,1) with constant
fit5 <- aus |> model(ARIMA(Passengers ~ 1 + pdq(0,2,1)))
## Warning: Model specification induces a quadratic or higher order polynomial trend. This
## is generally discouraged, consider removing the constant or reducing the number
## of differences.
fit5 |> forecast(h=10) |> autoplot(aus_airpassengers)
fit5 |> gg_tsresiduals()
Adding a constant to this model induces a quadratic trend in the forecasts, which is generally discouraged as it leads to unrealistic long-term forecasts that grow explosively. We can see that the intervals are expanding here and that the slope is becoming much steeper. The ARIMA(0,2,1) model without the constant from (a) looked more realistic.
usa <- global_economy |>
filter(Country == "United States")
#a
usa |> autoplot(GDP)
#Calculate lambda
lambda5 <- usa |>
features(GDP, features = guerrero) |>
pull(lambda_guerrero)
lambda5
## [1] 0.2819443
A Box-Cox transformation will be appropriate here.
#b
#Box-Cox and Arima
fit6 <- usa |> model(ARIMA(box_cox(GDP, lambda5)))
report(fit6)
## Series: GDP
## Model: ARIMA(1,1,0) w/ drift
## Transformation: box_cox(GDP, lambda5)
##
## Coefficients:
## ar1 constant
## 0.4586 118.1822
## s.e. 0.1198 9.5047
##
## sigma^2 estimated as 5479: log likelihood=-325.32
## AIC=656.65 AICc=657.1 BIC=662.78
#c
fit_alt <- usa |>
model(
arima010 = ARIMA(box_cox(GDP, lambda5) ~ pdq(0,1,0)),
arima110 = ARIMA(box_cox(GDP, lambda5) ~ pdq(1,1,0)),
arima120 = ARIMA(box_cox(GDP, lambda5) ~ pdq(1,2,0)),
arima011 = ARIMA(box_cox(GDP, lambda5) ~ pdq(0,1,1)),
arima111 = ARIMA(box_cox(GDP, lambda5) ~ pdq(1,1,1)),
arima210 = ARIMA(box_cox(GDP, lambda5) ~ pdq(2,1,0))
)
glance(fit_alt) |> arrange(AICc)
## # A tibble: 6 × 9
## Country .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 United States arima120 6780. -326. 656. 656. 660. <cpl [1]> <cpl [0]>
## 2 United States arima110 5479. -325. 657. 657. 663. <cpl [1]> <cpl [0]>
## 3 United States arima011 5689. -326. 659. 659. 665. <cpl [0]> <cpl [1]>
## 4 United States arima111 5580. -325. 659. 659. 667. <cpl [1]> <cpl [1]>
## 5 United States arima210 5580. -325. 659. 659. 667. <cpl [2]> <cpl [0]>
## 6 United States arima010 6774. -332. 668. 668. 672. <cpl [0]> <cpl [0]>
The top-ranked model is ARIMA(1,2,0), as it has the lowest AICc so I will choose that one.
d=1, q=0, dof=1
#d
fit_best <- usa |>
model(ARIMA(box_cox(GDP, lambda5) ~ pdq(1,2,0)))
fit_best |> gg_tsresiduals()
fit_best |> augment() |>
features(.innov, ljung_box, lag=10, dof=1)
## # A tibble: 1 × 4
## Country .model lb_stat lb_pvalue
## <fct> <chr> <dbl> <dbl>
## 1 United States ARIMA(box_cox(GDP, lambda5) ~ pdq(1, 2, 0)) 10.9 0.283
The resdiuals are white noise, and the Ljung-Box test confirms it too.
#e
#ARIMA forecast
fit_best |> forecast(h=10) |> autoplot(usa)
The forecast seems relatively reasonable, the trend line appears neither overly-conservative nor extremely optimistic, though the intervals fan out significantly. Interestingly, if we plot the second-lowest AICc model, the original ARIMA(1,1,0) model, that forecast looks significantly more confident. However, a narrower interval doesn’t mean a better forecast because the ARIMA(1,2,0) has the lower AICc and its wider intervals may simply be more honest about the true uncertainty.
fit6 |> forecast(h=10) |> autoplot(usa)
# f
#ETS forecast
fit_ets <- usa |> model(ETS(GDP))
report(fit_ets)
## Series: GDP
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9990876
## beta = 0.5011949
##
## Initial states:
## l[0] b[0]
## 448093333334 64917355687
##
## sigma^2: 7e-04
##
## AIC AICc BIC
## 3190.787 3191.941 3201.089
fit_ets |> forecast(h=10) |> autoplot(usa)
We can’t use the AICc to compare the ARIMA and ETS models, since they are in different model classes. The ARIMA forecast is slightly more confident, while the ETS one has wider prediction intervals that fan out dramatically. The initial forecasts are similar in direction, but the ETS model is more uncertain about future values. Overall, I would say that the ARIMA model produces more reasonable looking forecasts based on that.