Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.
jan_vic_elec <- vic_elec |>
filter(yearmonth(Time) == yearmonth("2014 Jan")) |>
index_by(Date = as_date(Time)) |>
summarise(Demand = sum(Demand), Temperature = max(Temperature))
jan_vic_elec |>
pivot_longer(2:3, names_to="key", values_to="value")|>
autoplot(.vars = value) +
facet_grid(vars(key), scales = "free_y")
jan_vic_elec |>
ggplot(aes(x = Temperature, y = Demand)) +
geom_point()
fit <- jan_vic_elec |>
model(TSLM(Demand ~ Temperature))
fit |> report()
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
It is clear that a positive relationship exists for this data. It is largely driven by days with high temperatures, (there are 3 asterisks), which is resulting in more electricity being demanded (presumably to keep things cool).
Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
fit |> gg_tsresiduals()
The residuals from this section of the data suggest that the model is lacking a trend (although looking at a larger window of the data, this is clearly not a real pattern). Some large variability suggests that there are some outliers in this data.
Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘C and compare it with the forecast if the with maximum temperature was 35∘C. Do you believe these forecasts?
next_day <- scenarios(
`Cold` = new_data(jan_vic_elec, 1) |> mutate(Temperature = 15),
`Hot` = new_data(jan_vic_elec, 1) |> mutate(Temperature = 35)
)
fc <- fit |> forecast(new_data = next_day)
autoplot(jan_vic_elec, Demand) + autolayer(fc)
The forecasts seem reasonable. However we should be aware that there is not much data to support the forecasts at these temperature extremes, especially in that no daily maximum below 20∘C is observed during January (a summer month in Victoria).
Give prediction intervals for your forecasts.
fc |> hilo() |> select(-.model)
## # A tsibble: 2 x 7 [1D]
## # Key: .scenario [2]
## .scenario Date Demand .mean Temperature
## <chr> <date> <dist> <dbl> <dbl>
## 1 Cold 2014-02-01 N(151398, 6.8e+08) 151398. 15
## 2 Hot 2014-02-01 N(274484, 6.4e+08) 274484. 35
## # ℹ 2 more variables: `80%` <hilo>, `95%` <hilo>
that’s weird, how can I get my interval out? no solution for me now I find a solution from: https://stackoverflow.com/questions/54894435/fable-from-distribution-to-confidence-interval, I need to unpack …
but crazy, I cannot unpack two together, need to be done one by one
fc %>%
hilo(level = c(80, 95)) %>%
unpack_hilo("80%") %>% select("80%_lower", "80%_upper")
## # A tsibble: 2 x 5 [?]
## # Key: .scenario, .model [2]
## `80%_lower` `80%_upper` Date .scenario .model
## <dbl> <dbl> <date> <chr> <chr>
## 1 117908. 184889. 2014-02-01 Cold TSLM(Demand ~ Temperature)
## 2 242088. 306880. 2014-02-01 Hot TSLM(Demand ~ Temperature)
fc %>%
hilo(level = c(80, 95)) %>%
unpack_hilo("95%") %>% select("95%_lower", "95%_upper")
## # A tsibble: 2 x 5 [?]
## # Key: .scenario, .model [2]
## `95%_lower` `95%_upper` Date .scenario .model
## <dbl> <dbl> <date> <chr> <chr>
## 1 100179. 202617. 2014-02-01 Cold TSLM(Demand ~ Temperature)
## 2 224939. 324029. 2014-02-01 Hot TSLM(Demand ~ Temperature)
probably Rob has a solution to this stupid error
Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?
vic_elec |> index_by(Date = as_date(Time)) |>
summarise(Demand = sum(Demand), Temperature = max(Temperature)) |>
ggplot(aes(x = Temperature, y = Demand)) +
geom_point()
The “U” shaped scatterplot suggests that there is a non-linear relationship between electricity demand and daily maximum temperature. The model assumption that electricity demand is linearly predicted by temperature is incorrect.
The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
souvenirs |> autoplot(Sales)
Features of the data:
Explain why it is necessary to take logarithms of these data before fitting a model.
souvenirs |> autoplot(log(Sales))
Killing the heteroscedasticity in my statistical nerd’s view. After taking the logs, the trend looks more linear and the seasonal variation is roughly constant.
Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
fit <- souvenirs |>
mutate(festival = month(Month) == 3 & year(Month) != 1987) |>
model(reg = TSLM(log(Sales) ~ trend() + season() + festival))
souvenirs |>
autoplot(Sales, col = "black") +
geom_line(data = augment(fit), aes(y = .fitted), col = "blue")
Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
fit |> gg_tsresiduals()
The residuals are serially correlated. they are out of the critical line. This is both visible from the time plot but also from the ACF. The residuals reveal non linearity in the trend.
augment(fit) |>
ggplot(aes(x = .fitted, y = .innov)) +
geom_point() +
scale_x_log10()
This is far more better now. The plot of residuals against fitted values looks fine - no notable patterns emerge. We take logarithms of fitted values because we took logs in the model.
Do boxplots of the residuals for each month. Does this reveal any problems with the model?
augment(fit) |>
mutate(month = lubridate::month(Month, label = TRUE)) |>
ggplot(aes(x = month, y = .innov)) +
geom_boxplot()
using the lubridate::month !!! and this is crazily heteroscedasticity #### explanation The boxplots show differences in variation across the months revealing some potential heteroscedasticity.
What do the values of the coefficients tell you about each variable?
tidy(fit) |> mutate(pceffect = (exp(estimate) - 1) * 100)
## # A tibble: 14 × 7
## .model term estimate std.error statistic p.value pceffect
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 reg (Intercept) 7.62 0.0742 103. 4.67e-78 203688.
## 2 reg trend() 0.0220 0.000827 26.6 2.32e-38 2.23
## 3 reg season()year2 0.251 0.0957 2.63 1.06e- 2 28.6
## 4 reg season()year3 0.266 0.193 1.38 1.73e- 1 30.5
## 5 reg season()year4 0.384 0.0957 4.01 1.48e- 4 46.8
## 6 reg season()year5 0.409 0.0957 4.28 5.88e- 5 50.6
## 7 reg season()year6 0.449 0.0958 4.69 1.33e- 5 56.6
## 8 reg season()year7 0.610 0.0958 6.37 1.71e- 8 84.1
## 9 reg season()year8 0.588 0.0959 6.13 4.53e- 8 80.0
## 10 reg season()year9 0.669 0.0959 6.98 1.36e- 9 95.3
## 11 reg season()year10 0.747 0.0960 7.79 4.48e-11 111.
## 12 reg season()year11 1.21 0.0960 12.6 1.29e-19 234.
## 13 reg season()year12 1.96 0.0961 20.4 3.39e-31 612.
## 14 reg festivalTRUE 0.502 0.196 2.55 1.29e- 2 65.1
What does the Ljung-Box test tell you about your model?
augment(fit) |>
features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 reg 112. 2.15e-13
The serial correlation in the residuals is significant.
Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
future_souvenirs <- new_data(souvenirs, n = 36) |>
mutate(festival = month(Month) == 3)
fit |>
forecast(new_data = future_souvenirs) |>
autoplot(souvenirs)
How could you improve these predictions by modifying the model?
The model can be improved by taking into account 1. nonlinearity of the trend. 2. dealing with the heteroscedasticity
The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “thousand barrels per day”. Consider only the data to the end of 2004.
Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.
gas <- us_gasoline |> filter(year(Week) <= 2004)
gas |> autoplot(Barrels)
Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.
fit <- gas |>
model(
fourier1 = TSLM(Barrels ~ trend() + fourier(K = 1)),
fourier2 = TSLM(Barrels ~ trend() + fourier(K = 2)),
fourier3 = TSLM(Barrels ~ trend() + fourier(K = 3)),
fourier4 = TSLM(Barrels ~ trend() + fourier(K = 4)),
fourier5 = TSLM(Barrels ~ trend() + fourier(K = 5)),
fourier6 = TSLM(Barrels ~ trend() + fourier(K = 6)),
fourier7 = TSLM(Barrels ~ trend() + fourier(K = 7)),
fourier8 = TSLM(Barrels ~ trend() + fourier(K = 8)),
fourier9 = TSLM(Barrels ~ trend() + fourier(K = 9)),
fourier10 = TSLM(Barrels ~ trend() + fourier(K = 10)),
fourier11 = TSLM(Barrels ~ trend() + fourier(K = 11)),
fourier12 = TSLM(Barrels ~ trend() + fourier(K = 12)),
fourier13 = TSLM(Barrels ~ trend() + fourier(K = 13)),
fourier14 = TSLM(Barrels ~ trend() + fourier(K = 14)),
fourier15 = TSLM(Barrels ~ trend() + fourier(K = 15))
)
glance(fit) |>
arrange(AICc) |>
select(.model, AICc, CV)
## # A tibble: 15 × 3
## .model AICc CV
## <chr> <dbl> <dbl>
## 1 fourier7 -1887. 0.0740
## 2 fourier11 -1885. 0.0742
## 3 fourier8 -1885. 0.0743
## 4 fourier12 -1884. 0.0742
## 5 fourier6 -1884. 0.0744
## 6 fourier9 -1882. 0.0745
## 7 fourier10 -1881. 0.0746
## 8 fourier13 -1881. 0.0745
## 9 fourier14 -1879. 0.0748
## 10 fourier15 -1876. 0.0750
## 11 fourier5 -1862. 0.0766
## 12 fourier4 -1856. 0.0773
## 13 fourier3 -1853. 0.0777
## 14 fourier2 -1814. 0.0819
## 15 fourier1 -1809. 0.0825
Best model has order 7, with the lowest AICc
gas |>
autoplot(Barrels, col = "gray") +
geom_line(
data = augment(fit) |> filter(.model == "fourier7"),
aes(y = .fitted), col = "blue"
)
fourier transformation is so strong!!! I’m going to find the mathematical proof for this! this will be some part in my thesis. This seems to have captured the annual seasonality quite well.
Check the residuals of the final model using the gg_tsresiduals() function. Use a Ljung-Box test to check for residual autocorrelation.
fit |>
select(fourier7) |>
gg_tsresiduals()
There is a small remaining autocorrelation at lag 1 and 20. then we are doing some Ljung-Box tests to ckeck how strong the autocorrelation is.
augment(fit) |>
filter(.model == "fourier7") |>
features(.innov, ljung_box, lag = 52)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 fourier7 56.9 0.299
can not reject the fact that is significant with type I error. The Ljung-Box test is not significant. Even if the Ljung-Box test had given a significant result here, the correlations are very small, so it would not make much difference to the forecasts and prediction intervals.
Generate forecasts for the next year of data. Plot these along with the actual data for 2005. Comment on the forecasts.
fit |>
select("fourier7") |>
forecast(h = "1 year") |>
autoplot(filter_index(us_gasoline, "2000" ~ "2006"))
The forecasts fit pretty well for the first six months, but not so well after that.
The annual population of Afghanistan is available in the global_economy data set.
Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war? #### plotting
global_economy |>
filter(Country == "Afghanistan") |>
autoplot(Population / 1e6) +
labs(y = "Population (millions)") +
geom_ribbon(aes(xmin = 1979.98, xmax = 1989.13), fill = "blue", alpha = 0.4) +
annotate("text", x = 1984.5, y = 10, label = "Soviet-Afghan war", col = "red", size = 6)
The population increases slowly from 1960 to 1980, then decreases during the Soviet-Afghan war (24 Dec 1979 – 15 Feb 1989), and has increased relatively rapidly since then. The last 30 years has shown an almost linear increase in population.
Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989. #### plotting
fit <- global_economy |>
filter(Country == "Afghanistan") |>
model(
linear = TSLM(Population ~ trend()),
piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
)
augment(fit) |>
autoplot(.fitted) +
geom_line(aes(y = Population), colour = "black")
The fitted values show that the piecewise linear model has tracked the data very closely, while the linear model is inaccurate.
actually, I’m not quite understand what the piecewise linear model is… dive deep in the future
Generate forecasts from these two models for the five years after the end of the data, and comment on the results.
fc <- fit |> forecast(h = "5 years")
autoplot(fc) +
autolayer(filter(global_economy |> filter(Country == "Afghanistan")), Population)
The linear model is clearly incorrect with prediction intervals too wide, and the point forecasts too low.
The piecewise linear model looks good, but the prediction intervals are probably too narrow. This model assumes that the trend since the last knot will continue unchanged, which is unrealistic.
Forecast the Chinese GDP from the global_economy data set using an ETS model. Experiment with the various options in the ETS() function to see how much the forecasts change with damped trend, or with a Box-Cox transformation. Try to develop an intuition of what each is doing to the forecasts.
[Hint: use a relatively large value of h when forecasting, so you can clearly see the differences between the various options when plotting the forecasts.]
*China GDP ^_^*
china <- global_economy |>
filter(Country == "China")
china |> autoplot(GDP)
It clearly needs a relatively strong transformation due to the increasing variance.
china |> autoplot(box_cox(GDP, 0.2))
china |> features(GDP, guerrero)
## # A tibble: 1 × 2
## Country lambda_guerrero
## <fct> <dbl>
## 1 China -0.0345
The Guerrero method suggests an even stronger transformation. Let’s also try a log.
fit <- china |>
model(
ets = ETS(GDP),
ets_damped = ETS(GDP ~ trend("Ad")),
ets_bc = ETS(box_cox(GDP, 0.2)),
ets_log = ETS(log(GDP))
)
fit
## # A mable: 1 x 5
## # Key: Country [1]
## Country ets ets_damped ets_bc ets_log
## <fct> <model> <model> <model> <model>
## 1 China <ETS(M,A,N)> <ETS(M,Ad,N)> <ETS(A,A,N)> <ETS(A,A,N)>
the box-cox method using the lambda = 0.2 here
augment(fit)
## # A tsibble: 232 x 7 [1Y]
## # Key: Country, .model [4]
## Country .model Year GDP .fitted .resid .innov
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 China ets 1960 59716467625. 49001691297. 10714776328. 0.219
## 2 China ets 1961 50056868958. 66346643194. -16289774236. -0.246
## 3 China ets 1962 47209359006. 51607368186. -4398009180. -0.0852
## 4 China ets 1963 50706799903. 47386494407. 3320305495. 0.0701
## 5 China ets 1964 59708343489. 51919091574. 7789251914. 0.150
## 6 China ets 1965 70436266147. 63350421234. 7085844913. 0.112
## 7 China ets 1966 76720285970. 76289186599. 431099371. 0.00565
## 8 China ets 1967 72881631327. 82708375812. -9826744486. -0.119
## 9 China ets 1968 70846535056. 75804820984. -4958285928. -0.0654
## 10 China ets 1969 79705906247. 72222259470. 7483646777. 0.104
## # ℹ 222 more rows
fit |>
forecast(h = "20 years") |>
autoplot(china, level = NULL)
china |> autoplot(GDP)
the log transformation is crazy… better than the box-cox with lambda 0.2 The transformations have a big effect, with small lambda values creating big increases in the forecasts. The damping has relatively a small effect.
Find an ETS model for the Gas data from aus_production and forecast the next few years. Why is multiplicative seasonality necessary here? Experiment with making the trend damped. Does it improve the forecasts?
aus_production |> autoplot(Gas)
Heteroscedasticity There is a huge increase in variance as the series increases in level. => That makes it necessary to use multiplicative seasonality.
fit <- aus_production |>
model(
hw = ETS(Gas ~ error("M") + trend("A") + season("M")),
hwdamped = ETS(Gas ~ error("M") + trend("Ad") + season("M")),
)
fit |> glance()
## # A tibble: 2 × 9
## .model sigma2 log_lik AIC AICc BIC MSE AMSE MAE
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 hw 0.00324 -831. 1681. 1682. 1711. 21.1 32.2 0.0413
## 2 hwdamped 0.00329 -832. 1684. 1685. 1718. 21.1 32.0 0.0417
The non-damped model seems to be doing slightly better here, probably because the trend is very strong over most of the historical data.
fit |>
select(hw) |>
gg_tsresiduals()
lag 2, 8, 19 is crazily big!!!
fit |> tidy()
## # A tibble: 19 × 3
## .model term estimate
## <chr> <chr> <dbl>
## 1 hw alpha 0.653
## 2 hw beta 0.144
## 3 hw gamma 0.0978
## 4 hw l[0] 5.95
## 5 hw b[0] 0.0706
## 6 hw s[0] 0.931
## 7 hw s[-1] 1.18
## 8 hw s[-2] 1.07
## 9 hw s[-3] 0.816
## 10 hwdamped alpha 0.649
## 11 hwdamped beta 0.155
## 12 hwdamped gamma 0.0937
## 13 hwdamped phi 0.980
## 14 hwdamped l[0] 5.86
## 15 hwdamped b[0] 0.0994
## 16 hwdamped s[0] 0.928
## 17 hwdamped s[-1] 1.18
## 18 hwdamped s[-2] 1.08
## 19 hwdamped s[-3] 0.817
fit |>
augment() |>
filter(.model == "hw") |>
features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 hw 57.1 0.000161
There is still some small correlations left in the residuals, showing the model has not fully captured the available information. There also appears to be some heteroskedasticity in the residuals with larger variance in the first half the series.
fit |>
forecast(h = 36) |>
filter(.model == "hw") |>
autoplot(aus_production)
While the point forecasts look ok, the intervals are excessively wide. how to deal with the excessively big interval???? any tip from Rob????? maybe deal with the lag!
Choose a series from us_employment, the total employment in different industries in the United States.
Produce an STL decomposition of the data and describe the trend and seasonality.
leisure <- us_employment |>
filter(Title == "Leisure and Hospitality")
leisure |>
autoplot(Employed)
The sudden change in the seasonal pattern is probably due to some
change in the definition of who is counted in this group. So our STL
decomposition will need to have a small seasonal window to handle that.
In addition, the variation changes a little as the level increases, so
we will also use a square root transformation.
not log transformation here for small increasing in variation
leisure |>
model(STL(sqrt(Employed) ~ season(window=7))) |>
components() |>
autoplot()
Do the data need transforming? If so, find a suitable transformation.
Yes. A square root did ok – the remainder series is relatively homoscedastic. No transformation or log transformations led to the remainder series appearing to be heteroscedastic.
leisure |> features(Employed, guerrero)
## # A tibble: 1 × 2
## Series_ID lambda_guerrero
## <chr> <dbl>
## 1 CEU7000000001 -0.216
The automatically selected transformation is close to logs.
BUT NOT LOG My preference is for something a little larger.
I think the automatic procedure is confusing the changing seasonality
with the increasing variance.
Are the data stationary? If not, find an appropriate differencing which yields stationary data.
leisure |>
autoplot(sqrt(Employed) |> difference(lag=12) |> difference())
## Warning: Removed 13 rows containing missing values (`geom_line()`).
Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AICc values?
leisure |>
gg_tsdisplay(sqrt(Employed) |> difference(lag=12) |> difference(), plot_type="partial")
## Warning: Removed 13 rows containing missing values (`geom_line()`).
## Warning: Removed 13 rows containing missing values (`geom_point()`).
This suggests that an ARIMA(2,1,0)(0,1,1) would be a good start. An
alternative would be an ARIMA(0,1,2)(0,1,1).
fit <- leisure |>
model(
arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1))
)
glance(fit)
## # A tibble: 2 × 9
## Series_ID .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 CEU7000000001 arima210011 0.0380 207. -406. -406. -386. <cpl [2]> <cpl>
## 2 CEU7000000001 arima012011 0.0381 206. -404. -404. -384. <cpl [0]> <cpl>
The ARIMA(2,1,0)(0,1,1) model is better.
Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
select(arima210011) |>
gg_tsresiduals()
The tails of the residual distribution are too long, and there is significant autocorrelation at lag 11, as well as some smaller significant spikes elsewhere.
fit <- leisure |>
model(
arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1)),
auto = ARIMA(sqrt(Employed))
)
glance(fit)
## # A tibble: 3 × 9
## Series_ID .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 CEU7000000001 arima210011 0.0380 207. -406. -406. -386. <cpl [2]> <cpl>
## 2 CEU7000000001 arima012011 0.0381 206. -404. -404. -384. <cpl [0]> <cpl>
## 3 CEU7000000001 auto 0.0365 226. -440. -440. -411. <cpl [2]> <cpl>
the auto model with the lowest AICc, that is quite crazy, the analysis of ACF and PACF is meaningless
fit |>
select(auto) |>
report()
## Series: Employed
## Model: ARIMA(2,1,2)(0,1,1)[12]
## Transformation: sqrt(Employed)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## 1.6261 -0.9132 -1.4773 0.7937 -0.5443
## s.e. 0.0400 0.0309 0.0535 0.0352 0.0340
##
## sigma^2 estimated as 0.03655: log likelihood=226.22
## AIC=-440.44 AICc=-440.35 BIC=-411.26
so this model is actually the best with AICc metric Model: ARIMA(2,1,2)(0,1,1)[12]
fit |> select(auto) |> report()
## Series: Employed
## Model: ARIMA(2,1,2)(0,1,1)[12]
## Transformation: sqrt(Employed)
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## 1.6261 -0.9132 -1.4773 0.7937 -0.5443
## s.e. 0.0400 0.0309 0.0535 0.0352 0.0340
##
## sigma^2 estimated as 0.03655: log likelihood=226.22
## AIC=-440.44 AICc=-440.35 BIC=-411.26
fit |>
select(auto) |>
gg_tsresiduals()
The residuals look better, although there is still a significant spike at lag 11.
Forecast the next 3 years of data. Get the latest figures from https://fred.stlouisfed.org/categories/11 to check the accuracy of your forecasts.
fc <- fit |>
forecast(h = "3 years")
fc |>
filter(.model=="auto") |>
autoplot(us_employment |> filter(year(Month) > 2000))
downloading the data…
update <- readr::read_csv("C:/Users/guoyy/Downloads/CEU7000000001.csv") |>
mutate(
Month = yearmonth(DATE),
Employed = CEU7000000001
) |>
select(Month, Employed) |>
as_tsibble(index=Month) |>
filter(Month >= min(fc$Month))
## Rows: 1023 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): CEU7000000001
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fc |> accuracy(update)
## # A tibble: 3 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 Test -2887. 3426. 2887. -22.3 22.3 NaN NaN 0.706
## 2 arima210011 Test -2885. 3425. 2885. -22.3 22.3 NaN NaN 0.706
## 3 auto Test -2931. 3463. 2931. -22.7 22.7 NaN NaN 0.705
fc |>
filter(.model=="auto") |>
autoplot(us_employment |> filter(year(Month) > 2000)) +
geom_line(data=update, aes(x=Month, y=Employed), col='red')
I WILL NEVER BELIEVE FORECASTING ANYMORE The initial
forecasts look great, but then the pandemic led to a huge impact on the
employment in this industry.
Eventually, the prediction intervals are so wide that the forecasts are not particularly useful. How many years of forecasts do you think are sufficiently accurate to be usable?
Given the pandemic, 2 years… but I know the answer now lol, this is a outdated question!
Choose one of the following seasonal time series: the Australian production of electricity, cement, or gas (from aus_production).
Do the data need transforming? If so, find a suitable transformation. #### plotting
aus_production |>
autoplot(Electricity)
this time, box-cox plays a better character
lambda <- aus_production |>
features(Electricity, guerrero) |>
pull(lambda_guerrero)
print(lambda)
## [1] 0.5195182
aus_production |>
autoplot(box_cox(Electricity, lambda))
guerrero() suggests using Box-Cox transformation with parameter
Are the data stationary? If not, find an appropriate differencing which yields stationary data.
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")
## Warning: Removed 4 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).
#### plotting with second order difference
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), plot_type = "partial")
## Warning: Removed 5 rows containing missing values (`geom_line()`).
## Warning: Removed 5 rows containing missing values (`geom_point()`).
aus_production |>
gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1) |> difference(1), plot_type = "partial")
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).
ehhh, worse with the increasing time difference.
We opt to take a first order difference as well.
Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?
From the above graph, an AR(1) or an MA(1) with a seasonal MA(2) might work. So an ARIMA(1,1,0)(0,1,2) model for the transformed data.
fit <- aus_production |>
model(
manual = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(0, 1, 2)),
auto = ARIMA(box_cox(Electricity, lambda))
)
fit |>
select(auto) |>
report()
## Series: Electricity
## Model: ARIMA(1,1,4)(0,1,1)[4]
## Transformation: box_cox(Electricity, lambda)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## -0.7030 0.2430 -0.4477 -0.1553 -0.2452 -0.5574
## s.e. 0.1739 0.1943 0.0973 0.0931 0.1189 0.1087
##
## sigma^2 estimated as 18.02: log likelihood=-609.08
## AIC=1232.17 AICc=1232.71 BIC=1255.7
glance(fit)
## # A tibble: 2 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 manual 19.7 -620. 1247. 1248. 1261. <cpl [1]> <cpl [8]>
## 2 auto 18.0 -609. 1232. 1233. 1256. <cpl [1]> <cpl [8]>
the best model: Model: ARIMA(1,1,4)(0,1,1)[4]
Automatic model selection with ARIMA() has also taken a first order difference, and so we can compare the AICc values. This is a challenging ARIMA model to select manually and the automatic model is clearly better.
Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.
fit |>
select(auto) |>
gg_tsresiduals()
fit |> select(auto) |> report()
## Series: Electricity
## Model: ARIMA(1,1,4)(0,1,1)[4]
## Transformation: box_cox(Electricity, lambda)
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 sma1
## -0.7030 0.2430 -0.4477 -0.1553 -0.2452 -0.5574
## s.e. 0.1739 0.1943 0.0973 0.0931 0.1189 0.1087
##
## sigma^2 estimated as 18.02: log likelihood=-609.08
## AIC=1232.17 AICc=1232.71 BIC=1255.7
tidy(fit)
## # A tibble: 9 × 6
## .model term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 manual ar1 -0.346 0.0648 -5.34 2.33e- 7
## 2 manual sma1 -0.731 0.0876 -8.34 9.16e-15
## 3 manual sma2 0.0820 0.0858 0.956 3.40e- 1
## 4 auto ar1 -0.703 0.174 -4.04 7.40e- 5
## 5 auto ma1 0.243 0.194 1.25 2.13e- 1
## 6 auto ma2 -0.448 0.0973 -4.60 7.15e- 6
## 7 auto ma3 -0.155 0.0931 -1.67 9.68e- 2
## 8 auto ma4 -0.245 0.119 -2.06 4.04e- 2
## 9 auto sma1 -0.557 0.109 -5.13 6.52e- 7
fit |>
select(auto) |>
augment() |>
features(.innov, ljung_box, dof = 6, lag = 12)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 auto 8.45 0.207
Residuals look reasonable, they resemble white noise.
Forecast the next 24 months of data using your preferred model.
fit |>
select(auto) |>
forecast(h = "2 years") |>
autoplot(aus_production)
Compare the forecasts obtained using ETS().
aus_production |>
model(ETS(Electricity)) |>
forecast(h = "2 years") |>
autoplot(aus_production)
aus_production |>
model(ETS(Electricity)) |>
forecast(h = "2 years") |>
autoplot(aus_production |> filter(year(Quarter) >= 2000)) +
autolayer(fit |> select(auto) |> forecast(h = "2 years"), colour = "pink", alpha = 0.4)
The point forecasts appear to be quite similar. The ETS forecasts have a wider forecast interval than the ARIMA forecasts. then ARIMA looks better