Author: Farhod Ibragimov
library(fpp3)
NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is
more appropriate in each case:global_economy)aus_population <- global_economy |>
filter(Code == 'AUS') |>
select(Population)
aus_population
## # A tsibble: 58 x 2 [1Y]
## Population Year
## <dbl> <dbl>
## 1 10276477 1960
## 2 10483000 1961
## 3 10742000 1962
## 4 10950000 1963
## 5 11167000 1964
## 6 11388000 1965
## 7 11651000 1966
## 8 11799000 1967
## 9 12009000 1968
## 10 12263000 1969
## # ℹ 48 more rows
aus_population |>
autoplot(Population)+
scale_y_continuous(labels = scales::comma)+
labs(title = 'Australian population time series')
aus_population_train <- aus_population |>
filter(Year < 2000)
aus_population_fit <- aus_population_train |>
model(RW_drift = RW(Population ~ drift()))
aus_population_fc <- aus_population_fit |> forecast(h = 20)
aus_population_fc |>
autoplot(aus_population_train, level = NULL) +
autolayer(
filter_index(aus_population, '2000' ~.),
colour = 'black'
)+
scale_y_continuous(labels = scales::comma)+
guides(colour = guide_legend(title = 'Forecast'))+
labs(title = 'Forecast for Australian population from year of 2000')
## Plot variable not specified, automatically selected `.vars = Population`
This is yearly time series of Australian population. Since it has
yearly period without seasonality and steady rising trend over years the
appropriate benchmark (and probably the best) forecast model is Random
Walk with Drift RW(y ~ drift())
aus_production)aus_bricks <- aus_production |>
select(Bricks)
aus_bricks
## # A tsibble: 218 x 2 [1Q]
## Bricks Quarter
## <dbl> <qtr>
## 1 189 1956 Q1
## 2 204 1956 Q2
## 3 208 1956 Q3
## 4 197 1956 Q4
## 5 187 1957 Q1
## 6 214 1957 Q2
## 7 227 1957 Q3
## 8 222 1957 Q4
## 9 199 1958 Q1
## 10 229 1958 Q2
## # ℹ 208 more rows
aus_bricks |>
ACF() |>
autoplot()
## Response variable not specified, automatically selected `var = Bricks`
From ACF plot we can see that this time series has seasonality pattern, and because autocorrelation coefficients are positive we can say that there is a presence of the trend pattern
aus_bricks |>
autoplot(Bricks)
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
As we see the time series has increasing trend till 1980’s, and
declining after.
There is a huge drop of production around 1982-1983, possibly related to
conditions of economy.
aus_bricks_train <- aus_bricks |>
filter(year(Quarter) <2000)
aus_bricks_fit <- aus_bricks_train |>
model(Snaive = SNAIVE(Bricks ~ lag('year')),
RWdrift = RW(Bricks~drift()))
aus_bricks_fc <- aus_bricks_fit |>
forecast(h = 30)
aus_bricks_fc |>
autoplot(aus_bricks_train, level = NULL)
After applying Seasonal Naive and Random walk with the drift models, and considering the presence of seasonal pattern I can say that Seasonal Naive is better benchmark method for this series.
aus_livestock)lambs <- aus_livestock |>
filter(Animal == 'Lambs',
State == 'New South Wales',
!is.na(Count))
lambs
## # A tsibble: 558 x 4 [1M]
## # Key: Animal, State [1]
## Month Animal State Count
## <mth> <fct> <fct> <dbl>
## 1 1972 Jul Lambs New South Wales 587600
## 2 1972 Aug Lambs New South Wales 553700
## 3 1972 Sep Lambs New South Wales 494900
## 4 1972 Oct Lambs New South Wales 533500
## 5 1972 Nov Lambs New South Wales 574300
## 6 1972 Dec Lambs New South Wales 517500
## 7 1973 Jan Lambs New South Wales 562600
## 8 1973 Feb Lambs New South Wales 426900
## 9 1973 Mar Lambs New South Wales 496300
## 10 1973 Apr Lambs New South Wales 496000
## # ℹ 548 more rows
lambs |>
autoplot(Count)
lambs |>
ACF() |>
autoplot()
## Response variable not specified, automatically selected `var = Count`
In this plot i wanted to see if seasonality is present in this
series. There are some spikes at lags 1, 12, 23 – showing spikes every
11 months. That’s very unusual and not enough evidence to make a
conclusion that seasonality present in this series.
Also the trend of the series can overlap seasonality pattern to be
detected in ACF. I think after decomposition it may show the seasonal
pattern
lamb_train <- lambs |>
filter_index('1973 Jan' ~ '2005 Jan')
lamb_fit <- lamb_train |>
model(
Mean = MEAN(Count),
Naive = NAIVE(Count),
Seas_naive = SNAIVE(Count),
Rand_walk = RW(Count ~ drift())
)
lamb_fc <- lamb_fit |>
forecast(h = 24)
lamb_fc |>
autoplot(lamb_train, level = NULL)
In ACF plot i wanted to see if seasonality is present in this series.
There are some spikes at lags 1, 12, 23 – showing spikes every 11
months. That’s very unusual and not enough evidence to make a conclusion
that seasonality present in this series.
Also the trend of the series can overlap seasonality pattern to be
detected in ACF. I think after decomposition it may show the seasonal
pattern.
Seasonal Naive benchmark method is most appropriate for this series.
hh_budget).?hh_budget
## starting httpd help server ... done
wealth <- hh_budget |>
filter(Country == 'Australia') |>
select(Wealth)
wealth
## # A tsibble: 22 x 2 [1Y]
## Wealth Year
## <dbl> <dbl>
## 1 315. 1995
## 2 315. 1996
## 3 323. 1997
## 4 339. 1998
## 5 354. 1999
## 6 350. 2000
## 7 348. 2001
## 8 349. 2002
## 9 360. 2003
## 10 379. 2004
## # ℹ 12 more rows
wealth |> autoplot(Wealth)
wealth_fit <- wealth |>
model(RW(Wealth~drift()))
wealth_fit |>
forecast(h = 9) |>
autoplot(wealth, level= NULL)
The series apper to have overall increasing trend and I think the appropriate benchmark model for this seies is Random Walk with the Drift.
aus_retail).food_takeaway <- aus_retail |>
filter(Industry == 'Takeaway food services') |>
summarise(Turnover = sum(Turnover, na.rm = TRUE))
food_takeaway
## # A tsibble: 441 x 2 [1M]
## Month Turnover
## <mth> <dbl>
## 1 1982 Apr 194.
## 2 1982 May 194.
## 3 1982 Jun 186.
## 4 1982 Jul 190.
## 5 1982 Aug 190.
## 6 1982 Sep 195.
## 7 1982 Oct 209
## 8 1982 Nov 212.
## 9 1982 Dec 238.
## 10 1983 Jan 225.
## # ℹ 431 more rows
food_takeaway |>
autoplot(Turnover)
The plot of this series shows overall rising trend, and presence of
seasonality.
There is also the variance increases with the level of the trend over
the time, indicating Heteroscedasticity in this data.
To stabilize the variance I will apply Box-Cox and log
transformations.
lambda_food <- food_takeaway |>
features(Turnover, features = guerrero) |>
pull(lambda_guerrero)
food_takeaway <- food_takeaway |>
mutate(Turnover_log = log(Turnover)) |>
mutate(Turnover_bxcx = box_cox(Turnover, lambda_food))
lambda_food
## [1] 0.04226665
food_takeaway |>
select(Month, Turnover, Turnover_log, Turnover_bxcx) |>
pivot_longer(-Month,
names_to = 'Transformation',
values_to = 'Value') |>
autoplot(Value)+
facet_wrap(~Transformation,
nrow = 3,
scales = 'free_y')
#food_takeaway |>
# ACF(Turnover_log) |>
# autoplot()
As we see Box-Cox (lambda = 0.042) and Log transformation are almost identical, I will use Log transformation to apply a model.
ft_dcmp <- food_takeaway |>
model(stlf = decomposition_model(
STL(log(Turnover) ~ trend(window = 7), robust = TRUE),
NAIVE(season_adjust))
)
ft_dcmp |>
forecast(h = 36) |>
autoplot(food_takeaway, level = NULL)
The series appear to have a strong seasonal pattern, therefore Seasonal Naive benchmark method applied to variance stabilized data is appropriate.
gafa_stock) to do the following:fb_stock <- gafa_stock |>
filter(Symbol == 'FB') |>
mutate(day = row_number()) |>
update_tsibble(index = day, regular = TRUE)
fb_stock
## # A tsibble: 1,258 x 9 [1]
## # Key: Symbol [1]
## Symbol Date Open High Low Close Adj_Close Volume day
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 FB 2014-01-02 54.8 55.2 54.2 54.7 54.7 43195500 1
## 2 FB 2014-01-03 55.0 55.7 54.5 54.6 54.6 38246200 2
## 3 FB 2014-01-06 54.4 57.3 54.0 57.2 57.2 68852600 3
## 4 FB 2014-01-07 57.7 58.5 57.2 57.9 57.9 77207400 4
## 5 FB 2014-01-08 57.6 58.4 57.2 58.2 58.2 56682400 5
## 6 FB 2014-01-09 58.7 59.0 56.7 57.2 57.2 92253300 6
## 7 FB 2014-01-10 57.1 58.3 57.1 57.9 57.9 42449500 7
## 8 FB 2014-01-13 57.9 58.2 55.4 55.9 55.9 63010900 8
## 9 FB 2014-01-14 56.5 57.8 56.1 57.7 57.7 37503600 9
## 10 FB 2014-01-15 58.0 58.6 57.3 57.6 57.6 33663400 10
## # ℹ 1,248 more rows
fb_stock |>
autoplot(Close)+
labs(title = "Facebook closing stock prices")
fb_stock_fit <- fb_stock |>
model(Drift = RW(Close~drift()))
fb_stock_fc <- fb_stock_fit |>
forecast(h= 150) # forecasted for 90days since index = day
fb_stock_fc |>
autoplot(fb_stock, level = NULL) +
labs(title = "Facebook closing stock prices 150 days forecast",
subtitle = 'Drift forecast model in blue color')
line_data <- tibble(
Date = c(first(fb_stock$day), last(fb_stock$day)),
Close = c(first(fb_stock$Close), last(fb_stock$Close))
)
fb_stock_fc |>
autoplot(fb_stock, level = NULL) +
geom_line(data = line_data, aes(x = Date, y = Close),
colour = "blue", linetype = 2) +
labs(title = "Facebook closing stock prices 150 days forecast",
subtitle = 'Drift forecast model in blue color line\nIdentical dash line drawn between the first and last actual observations ')
In this plot dotted line is between first and last observations of the data, and we can see that RW Drift forecast’s solid blue line is extending that line.
fb_stock_before_2016 <- fb_stock |>
filter(year(Date)<=2015)
fb_stock_fit_all <- fb_stock_before_2016 |>
model(
Mean = MEAN(Close),
Naive = NAIVE(Close),
RWDrift = RW(Close~drift())
)
fb_stock_2016_jan_feb <- fb_stock |>
filter(yearmonth(Date) %in% yearmonth(c('2016 Jan', '2016 Feb')))
fb_stock_fc_all <- fb_stock_fit_all |>
forecast(new_data = fb_stock_2016_jan_feb)
fb_stock_fc_all |>
autoplot(fb_stock_before_2016, level = NULL) +
autolayer(fb_stock_2016_jan_feb, Close, color = 'black')+
guides(colour = guide_legend(title = 'Forecast'))
From this plot I can say that the MEAN model is inappropriate,
because stock prices are not fluctuating around it over long
period.
Random Walk with the drift model assumes that previous growth will
project into future forecast, but usually stock prices are not
guaranteed to have a constant growth.
The best and reasonable for short-term forecast is Naive model. The reason is that usually today’s price is the best to approximate tomorrow’s price, and Naive model is best option for stock prices benchmark forecast model.
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992) |>
select(Beer)
recent_production |>
autoplot(Beer)+
labs(title = 'Australian beer production time series')
There is a slightly declining overall trend, and strong evidence of seasonal pattern in this series.
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
Residual plot shows that residuals fluctuate around zero with no constant variance. There are couple of residuals with significant negative values.
Most of the lags on ACF plot are below confidence bounds, however there is noticeable spike at seasonal lag 4 on and suggests that Naive model didn’t fully captured seasonal pattern.
The histogram of residuals looks fairly symmetric and centered around zero. There is slight left-skewness.
Overall, these plots does not suggest that residuals looks like white noise. While the model captures much of the seasonal pattern, there is still some remaining seasonal pattern in the data.
aug <- augment(fit)
aug
## # A tsibble: 74 x 6 [1Q]
## # Key: .model [1]
## .model Quarter Beer .fitted .resid .innov
## <chr> <qtr> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Beer) 1992 Q1 443 NA NA NA
## 2 SNAIVE(Beer) 1992 Q2 410 NA NA NA
## 3 SNAIVE(Beer) 1992 Q3 420 NA NA NA
## 4 SNAIVE(Beer) 1992 Q4 532 NA NA NA
## 5 SNAIVE(Beer) 1993 Q1 433 443 -10 -10
## 6 SNAIVE(Beer) 1993 Q2 421 410 11 11
## 7 SNAIVE(Beer) 1993 Q3 410 420 -10 -10
## 8 SNAIVE(Beer) 1993 Q4 512 532 -20 -20
## 9 SNAIVE(Beer) 1994 Q1 449 433 16 16
## 10 SNAIVE(Beer) 1994 Q2 381 421 -40 -40
## # ℹ 64 more rows
aug |>
features(.innov, ljung_box, lag = 8) |>
mutate(
lb_stat = round(lb_stat, 2),
lb_pvalue = format(lb_pvalue, scientific = FALSE),
innov_mean = mean(aug$.innov, na.rm = TRUE)
)
## # A tibble: 1 × 4
## .model lb_stat lb_pvalue innov_mean
## <chr> <dbl> <chr> <dbl>
## 1 SNAIVE(Beer) 32.3 0.00008335611 -1.57
Ljung-Box test reveals large Q* = 32.27 and extremely low
p-value = 0.000083.
The p-value is less than 0.05 therefore we reject hypothesis that
residuals are white noise.
Ljung-Box test shows significant remaining autocorrelation in residuals
(as we saw at lag 4 on CAF plot), so we can suggest that SNAIVE model
does not perfectly capture underlying time series structure.
# Look a some forecasts
fit |> forecast(h=20) |> autoplot(recent_production)
I think because the series appear to have declining trend over the years it affects the residuals of SNAIVE model:
in this case Snaive model forecasts current quarter based on value of the same quarter from previous year, in another words \(\hat{y}_t = y_{t-4}\). Therefore it effectively removes trend in the forecast
Because declining trend appears in series, the value in each quarter is usually lower than the same quarter of the previous year, and SNAIVE forecast method using \(\hat{y}_t = y_{t-4}\) can cause forecast errors and increase autocorrelations of residuals.
global_economy and the Bricks series
from aus_production. Use whichever of NAIVE()
or SNAIVE() is more appropriate in each
case.4.1Australian Exports series from
global_economy
?global_economy
aus_export <- global_economy |>
filter(Code == 'AUS') |>
select(Exports)
aus_export
## # A tsibble: 58 x 2 [1Y]
## Exports Year
## <dbl> <dbl>
## 1 13.0 1960
## 2 12.4 1961
## 3 13.9 1962
## 4 13.0 1963
## 5 14.9 1964
## 6 13.2 1965
## 7 12.9 1966
## 8 12.9 1967
## 9 12.3 1968
## 10 12.0 1969
## # ℹ 48 more rows
aus_export |>
autoplot(Exports) +
labs(title = 'Australian exports',
y = 'Exports of goods and services (% of GDP)'
)
aus_export_fit <- aus_export |>
model(RW(Exports ~ drift()))
aus_export_fit |>
forecast(h = 5) |>
autoplot(aus_export)
aus_export_fit |> gg_tsresiduals()
The residual series plot shows residuals are fluctuating around zero,
and it does not appear to have constant variance.
ACF plot shows that most of lags are below confidence boundaries, except
at lag 1 showing slightly higher autocorrelation.
Residuals distribution plot appears to be approximately normal, with
mild left-skewness.
Overall it appears that residual looks like a white noise. Let’s see if
Ljung-Box test confirms it.
aus_export_aug <- augment(aus_export_fit)
#aus_export_aug
aus_export_aug |>
features(.innov, ljung_box, lag = 10) |>
mutate(
lb_stat = round(lb_stat, 2),
lb_pvalue = format(lb_pvalue, scientific = FALSE),
mean = mean(aus_export_aug$.innov, na.rm = TRUE)
)
## # A tibble: 1 × 4
## .model lb_stat lb_pvalue mean
## <chr> <dbl> <chr> <dbl>
## 1 RW(Exports ~ drift()) 16.4 0.08963678 -8.41e-16
This test produces a larger p-value = 0.09, which means we fail to reject the null hypothesis of no autocorrelation in the residuals. This suggests that residuals looks like white noise and that the model has captured the underlying structure of the time series reasonably well.
4.2 Bricks series
from aus_production
?aus_production
bricks <- aus_production |>
select(Bricks) |>
filter(year(Quarter) >=1992)
bricks
## # A tsibble: 74 x 2 [1Q]
## Bricks Quarter
## <dbl> <qtr>
## 1 383 1992 Q1
## 2 404 1992 Q2
## 3 446 1992 Q3
## 4 420 1992 Q4
## 5 394 1993 Q1
## 6 462 1993 Q2
## 7 475 1993 Q3
## 8 443 1993 Q4
## 9 421 1994 Q1
## 10 475 1994 Q2
## # ℹ 64 more rows
bricks |>
autoplot(Bricks)+
labs(title = 'Australian bricks production',
y = 'Bricks (millions)')
## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).
bricks_fit <- bricks |>
model(SNAIVE(Bricks))
bricks_fit |> gg_tsresiduals()
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 24 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_rug()`).
The residual plot strongly suggests that residuals do not have zero
mean, there is no stable variation with couple of extreme negative
residuals around 1997 and 2003 possibly resulting from recessions in
economy. From this I already can say residuals are not like a white
noise.
AFC plot has a significantly high autocorrelation at lag 1, with several
other lags also above confidence bounds. If residuals were white noise,
most spikes would lie inside the confidence limits.
The distribution on histogram plot does not appear to be normal, it
looks bi-modal with the strong left skew.
Overall these residual diagnostic plots strongly suggest that residuals
are not white noise.
bricks_aug <- augment(bricks_fit)
bricks_aug
## # A tsibble: 74 x 6 [1Q]
## # Key: .model [1]
## .model Quarter Bricks .fitted .resid .innov
## <chr> <qtr> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Bricks) 1992 Q1 383 NA NA NA
## 2 SNAIVE(Bricks) 1992 Q2 404 NA NA NA
## 3 SNAIVE(Bricks) 1992 Q3 446 NA NA NA
## 4 SNAIVE(Bricks) 1992 Q4 420 NA NA NA
## 5 SNAIVE(Bricks) 1993 Q1 394 383 11 11
## 6 SNAIVE(Bricks) 1993 Q2 462 404 58 58
## 7 SNAIVE(Bricks) 1993 Q3 475 446 29 29
## 8 SNAIVE(Bricks) 1993 Q4 443 420 23 23
## 9 SNAIVE(Bricks) 1994 Q1 421 394 27 27
## 10 SNAIVE(Bricks) 1994 Q2 475 462 13 13
## # ℹ 64 more rows
bricks_aug |>
features(.innov, ljung_box, lag = 8) |>
mutate(
lb_stat = round(lb_stat, 2),
lb_pvalue = format(lb_pvalue, scientific = FALSE),
mean = mean(bricks_aug$.innov, na.rm = TRUE)
)
## # A tibble: 1 × 4
## .model lb_stat lb_pvalue mean
## <chr> <dbl> <chr> <dbl>
## 1 SNAIVE(Bricks) 62.2 0.0000000001735133 -0.76
As we see p-value from Ljung-Box test is extremely low and almost
equals to zero, and confirms that residuals are not like a white
noise.
More advanced approaches and models likely perform better than SNAIVE
model in this series.
For this exercise I would like to choose “Supermarket and grocery stores” time series.
supermarket <- aus_retail |>
filter(Industry == 'Supermarket and grocery stores') |>
summarise(Turnover = sum(Turnover))
supermarket
## # A tsibble: 441 x 2 [1M]
## Month Turnover
## <mth> <dbl>
## 1 1982 Apr 919.
## 2 1982 May 906.
## 3 1982 Jun 918.
## 4 1982 Jul 957.
## 5 1982 Aug 909.
## 6 1982 Sep 940.
## 7 1982 Oct 984.
## 8 1982 Nov 1017.
## 9 1982 Dec 1174.
## 10 1983 Jan 945.
## # ℹ 431 more rows
This series have 441 monthly observations of overall Australian supermarket and grocery stores sales turnovers in a time range of April 1982 to November 2018.
supermarket |>
autoplot(Turnover)+
labs(title = 'Australian Supermarket and grocery stores sales',
y = 'Turnover ($ million AUD)')
supermarket_train <- supermarket |>
filter(year(Month) < 2011)
supermarket |>
autoplot(Turnover)+
autolayer(supermarket_train, Turnover, color = 'red')+
labs(title = 'Time series splitted into train and test sets',
subtitle = 'Train set before 2011 in red color',
y = 'Turnover ($ million AUD)' )
SNAIVE() applied to your training data
(myseries_train).The variance of the data proportionally increases with the level of
the trend over time, therefore it is appropriate to apply a Log
transformation.
There also rising trend with strong evidence of seasonality in this
series.
I will fit following models to see how residuals and accuracy behave in
these models:
Seasonal Naive model applied to original data
Seasonal Naive model applied to Log transformed data
Use STL decomposition on log-transformed data and apply Naive model to seasonal adjusted component.
supermarket_fit_decomposed <- supermarket_train |>
model(
stl_sadj = decomposition_model(
STL(log(Turnover)~ trend(window = 15)+
season(window = 'periodic'), robust = TRUE
),
NAIVE(season_adjust))
)
supermarket_fit <- supermarket_train |>
model(Snaive_log = SNAIVE(log(Turnover)),
Snaive = SNAIVE(Turnover),
STL_LOG_Naive= decomposition_model(
STL(log(Turnover)~ trend(window = 15)+
season(window = 'periodic'), robust = TRUE
),
NAIVE(season_adjust)))
supermarket_fit |>
select(Snaive) |>
gg_tsresiduals()+
labs(title = 'Residual diagnostic plots for SNAIVE model on original data')
ACF plot reveals that all lags have significantly high
autocorrelations, strongly suggesting residuals are not a white
noise.
Also residual plot shows that almost all residuals are above zero, which
means residuals do not have zero mean. The variance of residuals do not
appear to be stable and increasing over time, because of the strong
rising trend in the series.
Histogram plot confirms majority of residuals are above zero and there
is strong right skeweness.
Overall, these residuals diagnostics strongly suggest that residuals
from SNAIVE model on original data are not white noise.
supermarket_fit |>
select(Snaive_log) |>
gg_tsresiduals()+
labs(title = 'Residual diagnostic plots for SNAIVE model on log-transformed data')
Residuals variance slightly stabilized compared to previous
diagnostics, but still almost all of the residuals are above zero and do
not have zero mean.
Log transformation reduced amount of lags with significantly high
autocorrelations compared to residuals from model using original data.
But there are still several lags well above confidence boundaries
confirming that residuals are not white noise.
Distribution of residuals appears to be close to be normal, however it’s
centered far right from zero.
Overall log transformation mildly improved SNAIVE model to capture structure of the series, but residual are still not a white noise.
supermarket_fit |>
select(STL_LOG_Naive) |>
gg_tsresiduals()+
labs(title = 'Residual diagnostic plots for decomposed Log transformed data')
Residuals appear to have zero mean and more stable variance compared
to previous models. However there is a significant spike around
1983.
ACF plot is much improved, but still has several lags with significant
autocorrelations.
Histogram of residuals distribution appears to be close to normal
centered close to zero.
Overall, NAIVE model fitted with decomposed component showed much improved residual diagnostic results compared to previous models. However, it did not fully captured structure of the series and residual are not a white noise.
supermarket_fc <- supermarket_fit |>
forecast(new_data = anti_join(supermarket,
supermarket_train))
## Joining with `by = join_by(Month, Turnover)`
supermarket_fc |>
autoplot(supermarket) +
facet_wrap(~.model, nrow = 3, scales = 'free_y')
As we see from this plots, prediction intervals of the model fitted
on original almost entirely missed actual test data.
Prediction intervals of the model trained on Log transformed data
captures actual values well below 95% probability bound, but with most
of actual values are above 80% probability bound.
As we see in the last plot STL_LOG_Naive 80% prediction interval
contains majority of the actual values.
Overall, the STL_LOG_Naive model appears to provide the most reasonable
forecasts. The actual data points are largely contained within its
prediction intervals, and the spread of those intervals looks consistent
with the variability of the series. The simple Snaive model on the
original data seems too restrictive for a series with strong growth and
multiplicative seasonality.
supermarket_fc |>
accuracy(supermarket, list(skill = skill_score(CRPS)))
## # A tibble: 3 × 3
## .model .type skill
## <chr> <chr> <dbl>
## 1 STL_LOG_Naive Test 0.273
## 2 Snaive Test 0
## 3 Snaive_log Test 0.212
As we see from comparisons using skill scores SNAIVE method fitted on
log transformed train set is 21.2% better than simple SNAIVE model
fitted on original scale data.
NAIVE model applied to STL decomposed component and Log transformed data
shows best 27.27% improvement in forecast interval accuracy compared to
SNAIVE fitted on original data.
This is consistent with what we observed visually. The original SNAIVE
model does not handle the increasing variance and evolving trend very
well. Log transformation improves stability, and combining STL
decomposition with log transformation gives the most accurate and
flexible forecasts intervals.
Overall, the results suggest that using Log or Box-Cox transformations for multiplicative structure (variance proportional change along trend level) and separating trend from seasonality improves forecasting performance for this series.
supermarket_fc |> accuracy(supermarket) |>
select(.model, RMSE, MAE, MAPE, MASE)
## # A tibble: 3 × 5
## .model RMSE MAE MAPE MASE
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 STL_LOG_Naive 1353. 1195. 14.2 5.82
## 2 Snaive 1564. 1380. 16.5 6.72
## 3 Snaive_log 1467. 1294. 15.4 6.30
From point accuracy measurements we can see that SNAIVE method with
original scale data has largest errors.
The SNAIVE with log-transformed data model improves all error measures
compared to original SNAIVE. RMSE and MAE are lower, and MAPE drops from
about 16.47% to 15.44%.
The STL_LOG_Naive model performs best. It has the lowest RMSE
(1353.280), lowest MAE (1194.703), lowest MAPE (14.23%).
So based on these results, STL_LOG_Naive is the most accurate model among the three. The improvement suggest that variance stabilization using log transformation, and separating trend and seasonality with STL decomposition helps to capture the underlying structure of the series better.