package setting

chapter 7

question 1

Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

question 1.a

plotting

jan_vic_elec <- vic_elec |>
    filter(yearmonth(Time) == yearmonth("2014 Jan")) |>
    index_by(Date = as_date(Time)) |>
    summarise(Demand = sum(Demand), Temperature = max(Temperature))

jan_vic_elec |>
    pivot_longer(2:3, names_to="key", values_to="value")|>
    autoplot(.vars = value) +
    facet_grid(vars(key), scales = "free_y")

regression

jan_vic_elec |>
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point()

fit <- jan_vic_elec |>
  model(TSLM(Demand ~ Temperature))
fit |> report()
## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -49978.2 -10218.9   -121.3  18533.2  35440.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832,  Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11

reporting

It is clear that a positive relationship exists for this data. It is largely driven by days with high temperatures, (there are 3 asterisks), which is resulting in more electricity being demanded (presumably to keep things cool).

question 1.b

Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

plotting

fit |> gg_tsresiduals()

explanation

The residuals from this section of the data suggest that the model is lacking a trend (although looking at a larger window of the data, this is clearly not a real pattern). Some large variability suggests that there are some outliers in this data.

question 1.c

Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘C and compare it with the forecast if the with maximum temperature was 35∘C. Do you believe these forecasts?

forecasting

next_day <- scenarios(
  `Cold` = new_data(jan_vic_elec, 1) |> mutate(Temperature = 15),
  `Hot` =  new_data(jan_vic_elec, 1) |> mutate(Temperature = 35)
)
fc <- fit |> forecast(new_data = next_day)
autoplot(jan_vic_elec, Demand) +  autolayer(fc)

explanation

The forecasts seem reasonable. However we should be aware that there is not much data to support the forecasts at these temperature extremes, especially in that no daily maximum below 20∘C is observed during January (a summer month in Victoria).

question 1.d

Give prediction intervals for your forecasts.

prediction

fc |>  hilo() |>  select(-.model) 
## # A tsibble: 2 x 7 [1D]
## # Key:       .scenario [2]
##   .scenario Date                   Demand   .mean Temperature
##   <chr>     <date>                 <dist>   <dbl>       <dbl>
## 1 Cold      2014-02-01 N(151398, 6.8e+08) 151398.          15
## 2 Hot       2014-02-01 N(274484, 6.4e+08) 274484.          35
## # ℹ 2 more variables: `80%` <hilo>, `95%` <hilo>

that’s weird, how can I get my interval out? no solution for me now I find a solution from: https://stackoverflow.com/questions/54894435/fable-from-distribution-to-confidence-interval, I need to unpack …

but crazy, I cannot unpack two together, need to be done one by one

fc %>% 
  hilo(level = c(80, 95)) %>% 
  unpack_hilo("80%") %>% select("80%_lower", "80%_upper")
## # A tsibble: 2 x 5 [?]
## # Key:       .scenario, .model [2]
##   `80%_lower` `80%_upper` Date       .scenario .model                    
##         <dbl>       <dbl> <date>     <chr>     <chr>                     
## 1     117908.     184889. 2014-02-01 Cold      TSLM(Demand ~ Temperature)
## 2     242088.     306880. 2014-02-01 Hot       TSLM(Demand ~ Temperature)
fc %>% 
  hilo(level = c(80, 95)) %>% 
  unpack_hilo("95%") %>% select("95%_lower", "95%_upper")
## # A tsibble: 2 x 5 [?]
## # Key:       .scenario, .model [2]
##   `95%_lower` `95%_upper` Date       .scenario .model                    
##         <dbl>       <dbl> <date>     <chr>     <chr>                     
## 1     100179.     202617. 2014-02-01 Cold      TSLM(Demand ~ Temperature)
## 2     224939.     324029. 2014-02-01 Hot       TSLM(Demand ~ Temperature)

probably Rob has a solution to this stupid error

question 1.e

Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

plotting

vic_elec |> index_by(Date = as_date(Time)) |>
    summarise(Demand = sum(Demand), Temperature = max(Temperature)) |>
    ggplot(aes(x = Temperature, y = Demand)) +
    geom_point()

explanation

The “U” shaped scatterplot suggests that there is a non-linear relationship between electricity demand and daily maximum temperature. The model assumption that electricity demand is linearly predicted by temperature is incorrect.

question 4

The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

question 4.a

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

plotting

souvenirs |> autoplot(Sales)

explanation

Features of the data:

  • Seasonal data – similar scaled pattern repeats every year
  • A spike every March (except for year 1987) is the influence of the surfing festival
  • The size of the pattern increases proportionally to the level of sales

question 4.b

Explain why it is necessary to take logarithms of these data before fitting a model.

plotting

souvenirs |> autoplot(log(Sales))

explanation

Killing the heteroscedasticity in my statistical nerd’s view. After taking the logs, the trend looks more linear and the seasonal variation is roughly constant.

question 4.c

Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

plotting

fit <- souvenirs |>
    mutate(festival = month(Month) == 3 & year(Month) != 1987) |>
    model(reg = TSLM(log(Sales) ~ trend() + season() + festival))
souvenirs |>
    autoplot(Sales, col = "black") +
    geom_line(data = augment(fit), aes(y = .fitted), col = "blue")

question 4.d

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

plotting against the time

fit |> gg_tsresiduals()

The residuals are serially correlated. they are out of the critical line. This is both visible from the time plot but also from the ACF. The residuals reveal non linearity in the trend.

plotting against the fitted value

augment(fit) |>
  ggplot(aes(x = .fitted, y = .innov)) +
  geom_point() +
  scale_x_log10()

explanation

This is far more better now. The plot of residuals against fitted values looks fine - no notable patterns emerge. We take logarithms of fitted values because we took logs in the model.

question 4.e

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

plotting

augment(fit) |>
  mutate(month = lubridate::month(Month, label = TRUE)) |>
  ggplot(aes(x = month, y = .innov)) +
  geom_boxplot()

using the lubridate::month !!! and this is crazily heteroscedasticity #### explanation The boxplots show differences in variation across the months revealing some potential heteroscedasticity.

question 4.f

What do the values of the coefficients tell you about each variable?

regression

tidy(fit) |> mutate(pceffect = (exp(estimate) - 1) * 100)
## # A tibble: 14 × 7
##    .model term           estimate std.error statistic  p.value  pceffect
##    <chr>  <chr>             <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 reg    (Intercept)      7.62    0.0742      103.   4.67e-78 203688.  
##  2 reg    trend()          0.0220  0.000827     26.6  2.32e-38      2.23
##  3 reg    season()year2    0.251   0.0957        2.63 1.06e- 2     28.6 
##  4 reg    season()year3    0.266   0.193         1.38 1.73e- 1     30.5 
##  5 reg    season()year4    0.384   0.0957        4.01 1.48e- 4     46.8 
##  6 reg    season()year5    0.409   0.0957        4.28 5.88e- 5     50.6 
##  7 reg    season()year6    0.449   0.0958        4.69 1.33e- 5     56.6 
##  8 reg    season()year7    0.610   0.0958        6.37 1.71e- 8     84.1 
##  9 reg    season()year8    0.588   0.0959        6.13 4.53e- 8     80.0 
## 10 reg    season()year9    0.669   0.0959        6.98 1.36e- 9     95.3 
## 11 reg    season()year10   0.747   0.0960        7.79 4.48e-11    111.  
## 12 reg    season()year11   1.21    0.0960       12.6  1.29e-19    234.  
## 13 reg    season()year12   1.96    0.0961       20.4  3.39e-31    612.  
## 14 reg    festivalTRUE     0.502   0.196         2.55 1.29e- 2     65.1

explanation

  1. (Intercept) is not interpretable. just think it as an affine transformation
  2. trend coefficient shows that with every month sales increases on average by 2.2%.
  3. season2 coefficient shows that February sales are greater than January on average by 28.6%, after allowing for the trend.
  4. season12 coefficient shows that December sales are greater than January on average by 611.5%, after allowing for the trend.
  5. festivalTRUE coefficient shows that for months that include the surfing festival, sales increases on average by 65.1% compared to months without the festival, after allowing for the trend and seasonality.

question 4.g

What does the Ljung-Box test tell you about your model?

explanation

augment(fit) |>
  features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 reg       112.  2.15e-13

The serial correlation in the residuals is significant.

question 4.h

Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

plotting

future_souvenirs <- new_data(souvenirs, n = 36) |>
  mutate(festival = month(Month) == 3)
fit |>
  forecast(new_data = future_souvenirs) |>
  autoplot(souvenirs)

question 4.i

How could you improve these predictions by modifying the model?

explanation

The model can be improved by taking into account 1. nonlinearity of the trend. 2. dealing with the heteroscedasticity

question 5

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “thousand barrels per day”. Consider only the data to the end of 2004.

5.a

Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

plotting

gas <- us_gasoline |> filter(year(Week) <= 2004)
gas |> autoplot(Barrels)

5.b

Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

adding fourier terms

fit <- gas |>
  model(
    fourier1 = TSLM(Barrels ~ trend() + fourier(K = 1)),
    fourier2 = TSLM(Barrels ~ trend() + fourier(K = 2)),
    fourier3 = TSLM(Barrels ~ trend() + fourier(K = 3)),
    fourier4 = TSLM(Barrels ~ trend() + fourier(K = 4)),
    fourier5 = TSLM(Barrels ~ trend() + fourier(K = 5)),
    fourier6 = TSLM(Barrels ~ trend() + fourier(K = 6)),
    fourier7 = TSLM(Barrels ~ trend() + fourier(K = 7)),
    fourier8 = TSLM(Barrels ~ trend() + fourier(K = 8)),
    fourier9 = TSLM(Barrels ~ trend() + fourier(K = 9)),
    fourier10 = TSLM(Barrels ~ trend() + fourier(K = 10)),
    fourier11 = TSLM(Barrels ~ trend() + fourier(K = 11)),
    fourier12 = TSLM(Barrels ~ trend() + fourier(K = 12)),
    fourier13 = TSLM(Barrels ~ trend() + fourier(K = 13)),
    fourier14 = TSLM(Barrels ~ trend() + fourier(K = 14)),
    fourier15 = TSLM(Barrels ~ trend() + fourier(K = 15))
  )
glance(fit) |>
  arrange(AICc) |>
  select(.model, AICc, CV)
## # A tibble: 15 × 3
##    .model      AICc     CV
##    <chr>      <dbl>  <dbl>
##  1 fourier7  -1887. 0.0740
##  2 fourier11 -1885. 0.0742
##  3 fourier8  -1885. 0.0743
##  4 fourier12 -1884. 0.0742
##  5 fourier6  -1884. 0.0744
##  6 fourier9  -1882. 0.0745
##  7 fourier10 -1881. 0.0746
##  8 fourier13 -1881. 0.0745
##  9 fourier14 -1879. 0.0748
## 10 fourier15 -1876. 0.0750
## 11 fourier5  -1862. 0.0766
## 12 fourier4  -1856. 0.0773
## 13 fourier3  -1853. 0.0777
## 14 fourier2  -1814. 0.0819
## 15 fourier1  -1809. 0.0825

Best model has order 7, with the lowest AICc

gas |>
  autoplot(Barrels, col = "gray") +
  geom_line(
    data = augment(fit) |> filter(.model == "fourier7"),
    aes(y = .fitted), col = "blue"
  )

explanation

fourier transformation is so strong!!! I’m going to find the mathematical proof for this! this will be some part in my thesis. This seems to have captured the annual seasonality quite well.

5.c

Check the residuals of the final model using the gg_tsresiduals() function. Use a Ljung-Box test to check for residual autocorrelation.

plotting

fit |>
  select(fourier7) |>
  gg_tsresiduals()

explanation

There is a small remaining autocorrelation at lag 1 and 20. then we are doing some Ljung-Box tests to ckeck how strong the autocorrelation is.

Ljung-Box test

augment(fit) |>
  filter(.model == "fourier7") |>
  features(.innov, ljung_box, lag = 52)
## # A tibble: 1 × 3
##   .model   lb_stat lb_pvalue
##   <chr>      <dbl>     <dbl>
## 1 fourier7    56.9     0.299

explanation

can not reject the fact that is significant with type I error. The Ljung-Box test is not significant. Even if the Ljung-Box test had given a significant result here, the correlations are very small, so it would not make much difference to the forecasts and prediction intervals.

5.c

Generate forecasts for the next year of data. Plot these along with the actual data for 2005. Comment on the forecasts.

fitting

fit |>
  select("fourier7") |>
  forecast(h = "1 year") |>
  autoplot(filter_index(us_gasoline, "2000" ~ "2006"))

explanation

The forecasts fit pretty well for the first six months, but not so well after that.

question 6

The annual population of Afghanistan is available in the global_economy data set.

6.a

Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war? #### plotting

global_economy |>
  filter(Country == "Afghanistan") |>
  autoplot(Population / 1e6) +
  labs(y = "Population (millions)") +
  geom_ribbon(aes(xmin = 1979.98, xmax = 1989.13), fill = "blue", alpha = 0.4) +
  annotate("text", x = 1984.5, y = 10, label = "Soviet-Afghan war", col = "red", size = 6)

explanation

The population increases slowly from 1960 to 1980, then decreases during the Soviet-Afghan war (24 Dec 1979 – 15 Feb 1989), and has increased relatively rapidly since then. The last 30 years has shown an almost linear increase in population.

6.b

Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989. #### plotting

fit <- global_economy |>
  filter(Country == "Afghanistan") |>
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
  )
augment(fit) |>
  autoplot(.fitted) +
  geom_line(aes(y = Population), colour = "black")

The fitted values show that the piecewise linear model has tracked the data very closely, while the linear model is inaccurate.

actually, I’m not quite understand what the piecewise linear model is… dive deep in the future

6.c

Generate forecasts from these two models for the five years after the end of the data, and comment on the results.

plotting

fc <- fit |> forecast(h = "5 years")
autoplot(fc) +
  autolayer(filter(global_economy |> filter(Country == "Afghanistan")), Population)

explanation

The linear model is clearly incorrect with prediction intervals too wide, and the point forecasts too low.

The piecewise linear model looks good, but the prediction intervals are probably too narrow. This model assumes that the trend since the last knot will continue unchanged, which is unrealistic.

chapter8

question 6

Forecast the Chinese GDP from the global_economy data set using an ETS model. Experiment with the various options in the ETS() function to see how much the forecasts change with damped trend, or with a Box-Cox transformation. Try to develop an intuition of what each is doing to the forecasts.

[Hint: use a relatively large value of h when forecasting, so you can clearly see the differences between the various options when plotting the forecasts.]

*China GDP ^_^*

plotting

china <- global_economy |>
  filter(Country == "China")
china |> autoplot(GDP)

It clearly needs a relatively strong transformation due to the increasing variance.

china |> autoplot(box_cox(GDP, 0.2))

china |> features(GDP, guerrero)
## # A tibble: 1 × 2
##   Country lambda_guerrero
##   <fct>             <dbl>
## 1 China           -0.0345

The Guerrero method suggests an even stronger transformation. Let’s also try a log.

fit <- china |>
  model(
    ets = ETS(GDP),
    ets_damped = ETS(GDP ~ trend("Ad")),
    ets_bc = ETS(box_cox(GDP, 0.2)),
    ets_log = ETS(log(GDP))
  )

fit
## # A mable: 1 x 5
## # Key:     Country [1]
##   Country          ets    ets_damped       ets_bc      ets_log
##   <fct>        <model>       <model>      <model>      <model>
## 1 China   <ETS(M,A,N)> <ETS(M,Ad,N)> <ETS(A,A,N)> <ETS(A,A,N)>

the box-cox method using the lambda = 0.2 here

augment(fit)
## # A tsibble: 232 x 7 [1Y]
## # Key:       Country, .model [4]
##    Country .model  Year          GDP      .fitted        .resid   .innov
##    <fct>   <chr>  <dbl>        <dbl>        <dbl>         <dbl>    <dbl>
##  1 China   ets     1960 59716467625. 49001691297.  10714776328.  0.219  
##  2 China   ets     1961 50056868958. 66346643194. -16289774236. -0.246  
##  3 China   ets     1962 47209359006. 51607368186.  -4398009180. -0.0852 
##  4 China   ets     1963 50706799903. 47386494407.   3320305495.  0.0701 
##  5 China   ets     1964 59708343489. 51919091574.   7789251914.  0.150  
##  6 China   ets     1965 70436266147. 63350421234.   7085844913.  0.112  
##  7 China   ets     1966 76720285970. 76289186599.    431099371.  0.00565
##  8 China   ets     1967 72881631327. 82708375812.  -9826744486. -0.119  
##  9 China   ets     1968 70846535056. 75804820984.  -4958285928. -0.0654 
## 10 China   ets     1969 79705906247. 72222259470.   7483646777.  0.104  
## # ℹ 222 more rows

comparaing the prediction

fit |>
  forecast(h = "20 years") |>
  autoplot(china, level = NULL)

china |> autoplot(GDP)

the log transformation is crazy… better than the box-cox with lambda 0.2 The transformations have a big effect, with small lambda values creating big increases in the forecasts. The damping has relatively a small effect.

question 7

Find an ETS model for the Gas data from aus_production and forecast the next few years. Why is multiplicative seasonality necessary here? Experiment with making the trend damped. Does it improve the forecasts?

plotting

aus_production |> autoplot(Gas)

Heteroscedasticity There is a huge increase in variance as the series increases in level. => That makes it necessary to use multiplicative seasonality.

regression

fit <- aus_production |>
  model(
    hw = ETS(Gas ~ error("M") + trend("A") + season("M")),
    hwdamped = ETS(Gas ~ error("M") + trend("Ad") + season("M")),
  )

fit |> glance()
## # A tibble: 2 × 9
##   .model    sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE
##   <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 hw       0.00324   -831. 1681. 1682. 1711.  21.1  32.2 0.0413
## 2 hwdamped 0.00329   -832. 1684. 1685. 1718.  21.1  32.0 0.0417

The non-damped model seems to be doing slightly better here, probably because the trend is very strong over most of the historical data.

fit |>
  select(hw) |>
  gg_tsresiduals()

lag 2, 8, 19 is crazily big!!!

extracting the information

fit |> tidy()
## # A tibble: 19 × 3
##    .model   term  estimate
##    <chr>    <chr>    <dbl>
##  1 hw       alpha   0.653 
##  2 hw       beta    0.144 
##  3 hw       gamma   0.0978
##  4 hw       l[0]    5.95  
##  5 hw       b[0]    0.0706
##  6 hw       s[0]    0.931 
##  7 hw       s[-1]   1.18  
##  8 hw       s[-2]   1.07  
##  9 hw       s[-3]   0.816 
## 10 hwdamped alpha   0.649 
## 11 hwdamped beta    0.155 
## 12 hwdamped gamma   0.0937
## 13 hwdamped phi     0.980 
## 14 hwdamped l[0]    5.86  
## 15 hwdamped b[0]    0.0994
## 16 hwdamped s[0]    0.928 
## 17 hwdamped s[-1]   1.18  
## 18 hwdamped s[-2]   1.08  
## 19 hwdamped s[-3]   0.817
fit |>
  augment() |>
  filter(.model == "hw") |>
  features(.innov, ljung_box, lag = 24)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 hw        57.1  0.000161

There is still some small correlations left in the residuals, showing the model has not fully captured the available information. There also appears to be some heteroskedasticity in the residuals with larger variance in the first half the series.

fit |>
  forecast(h = 36) |>
  filter(.model == "hw") |>
  autoplot(aus_production)

While the point forecasts look ok, the intervals are excessively wide. how to deal with the excessively big interval???? any tip from Rob????? maybe deal with the lag!

Chpater 9

question 10

Choose a series from us_employment, the total employment in different industries in the United States.

10.a

Produce an STL decomposition of the data and describe the trend and seasonality.

plotting

leisure <- us_employment |>
  filter(Title == "Leisure and Hospitality")
leisure |>
  autoplot(Employed)

The sudden change in the seasonal pattern is probably due to some change in the definition of who is counted in this group. So our STL decomposition will need to have a small seasonal window to handle that. In addition, the variation changes a little as the level increases, so we will also use a square root transformation. not log transformation here for small increasing in variation

STL Decomp

leisure |>
  model(STL(sqrt(Employed) ~ season(window=7))) |>
  components() |>
  autoplot()

  1. With such a long series, it is not surprising to see the seasonality change a lot over time.
  2. The seasonal pattern changed in the 1990s to what is is now.
  3. The period of change was rapid, and the seasonal component hasn’t fully captured the change, leading to some seasonality ending up in the remainder series.
  4. The trend is increasing.

10.b

Do the data need transforming? If so, find a suitable transformation.

Yes. A square root did ok – the remainder series is relatively homoscedastic. No transformation or log transformations led to the remainder series appearing to be heteroscedastic.

guerrrero test

leisure |> features(Employed, guerrero)
## # A tibble: 1 × 2
##   Series_ID     lambda_guerrero
##   <chr>                   <dbl>
## 1 CEU7000000001          -0.216

The automatically selected transformation is close to logs. BUT NOT LOG My preference is for something a little larger. I think the automatic procedure is confusing the changing seasonality with the increasing variance.

10.c

Are the data stationary? If not, find an appropriate differencing which yields stationary data.

DID

leisure |>
  autoplot(sqrt(Employed) |> difference(lag=12) |> difference())
## Warning: Removed 13 rows containing missing values (`geom_line()`).

10.d

Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AICc values?

plotting

leisure |>
  gg_tsdisplay(sqrt(Employed) |> difference(lag=12) |> difference(), plot_type="partial")
## Warning: Removed 13 rows containing missing values (`geom_line()`).
## Warning: Removed 13 rows containing missing values (`geom_point()`).

This suggests that an ARIMA(2,1,0)(0,1,1) would be a good start. An alternative would be an ARIMA(0,1,2)(0,1,1).

comparaing 2 ARIMA models

fit <- leisure |>
  model(
    arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
    arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1))
  )
glance(fit)
## # A tibble: 2 × 9
##   Series_ID     .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots
##   <chr>         <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>  
## 1 CEU7000000001 arima210011 0.0380    207. -406. -406. -386. <cpl [2]> <cpl>   
## 2 CEU7000000001 arima012011 0.0381    206. -404. -404. -384. <cpl [0]> <cpl>

The ARIMA(2,1,0)(0,1,1) model is better.

10.e

Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.

ggresiduals

fit |>
  select(arima210011) |>
  gg_tsresiduals()

The tails of the residual distribution are too long, and there is significant autocorrelation at lag 11, as well as some smaller significant spikes elsewhere.

fit <- leisure |>
  model(
    arima210011 = ARIMA(sqrt(Employed) ~ pdq(2,1,0) + PDQ(0,1,1)),
    arima012011 = ARIMA(sqrt(Employed) ~ pdq(0,1,2) + PDQ(0,1,1)),
    auto = ARIMA(sqrt(Employed))
  )
glance(fit)
## # A tibble: 3 × 9
##   Series_ID     .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots
##   <chr>         <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>  
## 1 CEU7000000001 arima210011 0.0380    207. -406. -406. -386. <cpl [2]> <cpl>   
## 2 CEU7000000001 arima012011 0.0381    206. -404. -404. -384. <cpl [0]> <cpl>   
## 3 CEU7000000001 auto        0.0365    226. -440. -440. -411. <cpl [2]> <cpl>

the auto model with the lowest AICc, that is quite crazy, the analysis of ACF and PACF is meaningless

fit |>
  select(auto) |>
  report()
## Series: Employed 
## Model: ARIMA(2,1,2)(0,1,1)[12] 
## Transformation: sqrt(Employed) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sma1
##       1.6261  -0.9132  -1.4773  0.7937  -0.5443
## s.e.  0.0400   0.0309   0.0535  0.0352   0.0340
## 
## sigma^2 estimated as 0.03655:  log likelihood=226.22
## AIC=-440.44   AICc=-440.35   BIC=-411.26

so this model is actually the best with AICc metric Model: ARIMA(2,1,2)(0,1,1)[12]

fit |> select(auto) |> report()
## Series: Employed 
## Model: ARIMA(2,1,2)(0,1,1)[12] 
## Transformation: sqrt(Employed) 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     sma1
##       1.6261  -0.9132  -1.4773  0.7937  -0.5443
## s.e.  0.0400   0.0309   0.0535  0.0352   0.0340
## 
## sigma^2 estimated as 0.03655:  log likelihood=226.22
## AIC=-440.44   AICc=-440.35   BIC=-411.26
fit |>
  select(auto) |>
  gg_tsresiduals()

The residuals look better, although there is still a significant spike at lag 11.

10.f

Forecast the next 3 years of data. Get the latest figures from https://fred.stlouisfed.org/categories/11 to check the accuracy of your forecasts.

fc <- fit |>
  forecast(h = "3 years")
fc |>
  filter(.model=="auto") |>
  autoplot(us_employment |> filter(year(Month) > 2000))

downloading the data…

update <- readr::read_csv("C:/Users/guoyy/Downloads/CEU7000000001.csv") |>
  mutate(
    Month = yearmonth(DATE),
    Employed = CEU7000000001
  ) |>
  select(Month, Employed) |>
  as_tsibble(index=Month) |>
  filter(Month >= min(fc$Month))
## Rows: 1023 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): CEU7000000001
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fc |> accuracy(update)
## # A tibble: 3 × 10
##   .model      .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>       <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima012011 Test  -2887. 3426. 2887. -22.3  22.3   NaN   NaN 0.706
## 2 arima210011 Test  -2885. 3425. 2885. -22.3  22.3   NaN   NaN 0.706
## 3 auto        Test  -2931. 3463. 2931. -22.7  22.7   NaN   NaN 0.705
fc |>
  filter(.model=="auto") |>
  autoplot(us_employment |> filter(year(Month) > 2000)) +
  geom_line(data=update, aes(x=Month, y=Employed), col='red')

I WILL NEVER BELIEVE FORECASTING ANYMORE The initial forecasts look great, but then the pandemic led to a huge impact on the employment in this industry.

10.g

Eventually, the prediction intervals are so wide that the forecasts are not particularly useful. How many years of forecasts do you think are sufficiently accurate to be usable?

Given the pandemic, 2 years… but I know the answer now lol, this is a outdated question!

question 11

Choose one of the following seasonal time series: the Australian production of electricity, cement, or gas (from aus_production).

11.a

Do the data need transforming? If so, find a suitable transformation. #### plotting

aus_production |>
  autoplot(Electricity)

this time, box-cox plays a better character

lambda <- aus_production |>
  features(Electricity, guerrero) |>
  pull(lambda_guerrero)
print(lambda)
## [1] 0.5195182
aus_production |>
  autoplot(box_cox(Electricity, lambda))

guerrero() suggests using Box-Cox transformation with parameter

11.b

Are the data stationary? If not, find an appropriate differencing which yields stationary data.

plotting with first order difference

aus_production |>
  gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4), plot_type = "partial")
## Warning: Removed 4 rows containing missing values (`geom_line()`).
## Warning: Removed 4 rows containing missing values (`geom_point()`).

#### plotting with second order difference

aus_production |>
  gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1), plot_type = "partial")
## Warning: Removed 5 rows containing missing values (`geom_line()`).
## Warning: Removed 5 rows containing missing values (`geom_point()`).

plotting with third order difference

aus_production |>
  gg_tsdisplay(box_cox(Electricity, lambda) |> difference(4) |> difference(1) |> difference(1), plot_type = "partial")
## Warning: Removed 6 rows containing missing values (`geom_line()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).

ehhh, worse with the increasing time difference. We opt to take a first order difference as well.

11.c

Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?

From the above graph, an AR(1) or an MA(1) with a seasonal MA(2) might work. So an ARIMA(1,1,0)(0,1,2) model for the transformed data.

fit <- aus_production |>
  model(
    manual = ARIMA(box_cox(Electricity, lambda) ~ 0 + pdq(1, 1, 0) + PDQ(0, 1, 2)),
    auto = ARIMA(box_cox(Electricity, lambda))
  )
fit |>
  select(auto) |>
  report()
## Series: Electricity 
## Model: ARIMA(1,1,4)(0,1,1)[4] 
## Transformation: box_cox(Electricity, lambda) 
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4     sma1
##       -0.7030  0.2430  -0.4477  -0.1553  -0.2452  -0.5574
## s.e.   0.1739  0.1943   0.0973   0.0931   0.1189   0.1087
## 
## sigma^2 estimated as 18.02:  log likelihood=-609.08
## AIC=1232.17   AICc=1232.71   BIC=1255.7
glance(fit)
## # A tibble: 2 × 8
##   .model sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
##   <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
## 1 manual   19.7   -620. 1247. 1248. 1261. <cpl [1]> <cpl [8]>
## 2 auto     18.0   -609. 1232. 1233. 1256. <cpl [1]> <cpl [8]>

the best model: Model: ARIMA(1,1,4)(0,1,1)[4]

Automatic model selection with ARIMA() has also taken a first order difference, and so we can compare the AICc values. This is a challenging ARIMA model to select manually and the automatic model is clearly better.

11.d

Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.

fit |>
  select(auto) |>
  gg_tsresiduals()

fit |> select(auto) |> report()
## Series: Electricity 
## Model: ARIMA(1,1,4)(0,1,1)[4] 
## Transformation: box_cox(Electricity, lambda) 
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4     sma1
##       -0.7030  0.2430  -0.4477  -0.1553  -0.2452  -0.5574
## s.e.   0.1739  0.1943   0.0973   0.0931   0.1189   0.1087
## 
## sigma^2 estimated as 18.02:  log likelihood=-609.08
## AIC=1232.17   AICc=1232.71   BIC=1255.7
tidy(fit)
## # A tibble: 9 × 6
##   .model term  estimate std.error statistic  p.value
##   <chr>  <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 manual ar1    -0.346     0.0648    -5.34  2.33e- 7
## 2 manual sma1   -0.731     0.0876    -8.34  9.16e-15
## 3 manual sma2    0.0820    0.0858     0.956 3.40e- 1
## 4 auto   ar1    -0.703     0.174     -4.04  7.40e- 5
## 5 auto   ma1     0.243     0.194      1.25  2.13e- 1
## 6 auto   ma2    -0.448     0.0973    -4.60  7.15e- 6
## 7 auto   ma3    -0.155     0.0931    -1.67  9.68e- 2
## 8 auto   ma4    -0.245     0.119     -2.06  4.04e- 2
## 9 auto   sma1   -0.557     0.109     -5.13  6.52e- 7

ljung_box test

fit |>
  select(auto) |>
  augment() |>
  features(.innov, ljung_box, dof = 6, lag = 12)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 auto      8.45     0.207

Residuals look reasonable, they resemble white noise.

11.e

Forecast the next 24 months of data using your preferred model.

fit |>
  select(auto) |>
  forecast(h = "2 years") |>
  autoplot(aus_production)

11.f

Compare the forecasts obtained using ETS().

plotting

aus_production |>
  model(ETS(Electricity)) |>
  forecast(h = "2 years") |>
  autoplot(aus_production)

combining the graphs together

aus_production |>
  model(ETS(Electricity)) |>
  forecast(h = "2 years") |>
  autoplot(aus_production |> filter(year(Quarter) >= 2000)) +
  autolayer(fit |> select(auto) |> forecast(h = "2 years"), colour = "pink", alpha = 0.4)

explanation

The point forecasts appear to be quite similar. The ETS forecasts have a wider forecast interval than the ARIMA forecasts. then ARIMA looks better