Exercises 5.1, 5.2, 5.3, 5.4 and 5.7 https://www.timescale.com/blog/what-is-time-series-forecasting
library(fpp3)
## Warning: package 'fpp3' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──
## ✔ tibble 3.2.1 ✔ tsibble 1.1.6
## ✔ dplyr 1.1.4 ✔ tsibbledata 0.4.1
## ✔ tidyr 1.3.1 ✔ feasts 0.4.1
## ✔ lubridate 1.9.4 ✔ fable 0.4.1
## ✔ ggplot2 3.5.1
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tsibble' was built under R version 4.3.3
## Warning: package 'tsibbledata' was built under R version 4.3.3
## Warning: package 'feasts' was built under R version 4.3.3
## Warning: package 'fabletools' was built under R version 4.3.3
## Warning: package 'fable' was built under R version 4.3.3
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date() masks base::date()
## ✖ dplyr::filter() masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tsibble::setdiff() masks base::setdiff()
## ✖ tsibble::union() masks base::union()
library(dplyr)
library(ggplot2)
library(tsibble)
library(fable)
[Australian Population (global_economy)]
aus_population <- global_economy %>%
filter(Code == 'AUS') %>%
select(Year, Population)
autoplot(aus_population , Population) +
labs(title = "Total Population in Australia")
# Drift
aus_population %>%
model(RW(Population ~ drift())) %>%
forecast(h = 10) %>%
autoplot(aus_population) +
labs(title = "Total Population in Australia", subtitle = "Drift Forecast")
# Checking accuracy
fit_mable <- aus_population %>%
model(naive = NAIVE(Population),
drift = RW(Population ~ drift()),
mean = MEAN(Population))
fc <- fit_mable %>%
forecast(h = 10)
accuracy(fit_mable) %>%
arrange(.model) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)
## # A tibble: 3 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift Training 76870. 59124. 0.347 0.235 0.293
## 2 mean Training 3979165. 3388690. 21.7 13.5 15.1
## 3 naive Training 262766. 251271. 1.52 1 1
Naive method doesn’t work as it uses the value of the prior observation, but this model shows exponential growth over time, so I used the drift method, which allows the forecast to increase or decrease over time. We can also see it has the lowest errors, thus the best accuracy
[Bricks (aus_production)]
bricks_data <- aus_production %>%
select(Quarter, Bricks) %>%
drop_na()
autoplot(bricks_data, Bricks) +
labs(title = "Brick Production")
# SNAIVE
bricks_data %>%
model(SNAIVE(Bricks)) %>%
forecast(h = 10) %>%
autoplot(bricks_data) +
labs(title = "Brick Production", subtitle = "SNAIVE Forecast")
We can see seasonal fluctuations, hence we use SNAIVE method.
[NSW Lambs (aus_livestock)]
nsw_lambs <- aus_livestock %>%
filter(Animal == "Lambs") %>%
filter(State == "New South Wales")
autoplot(nsw_lambs, Count) +
labs(title = "Lambs slaughtered in New South Wales")
nsw_lambs %>%
model(NAIVE(Count)) %>%
forecast(h = 50) %>%
autoplot(nsw_lambs) +
labs(title = "Lambs slaughtered in New South Wales", subtitle = "NAIVE Forecast")
There is no seasonality, nor is there constant increase/decrease, so we’ll use the Naive method
[Household wealth (hh_budget)]
hh_wealth <- hh_budget %>%
select(Year, Wealth)
autoplot(hh_wealth, Wealth) +
labs(title = "Wealth as a Percentage of Net Disposable Income")
# Check which model to use
fit_mable <- hh_wealth %>%
model(naive = NAIVE(Wealth),
drift = RW(Wealth ~ drift()),
mean = MEAN(Wealth))
fc <- fit_mable %>%
forecast(h = 10)
accuracy(fit_mable) %>%
arrange(.model) %>%
select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)
## # A tibble: 12 × 7
## .model .type RMSE MAE MAPE MASE RMSSE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift Training 22.6 15.6 4.44 0.923 0.975
## 2 drift Training 22.6 17.7 3.90 0.903 0.946
## 3 drift Training 16.4 12.1 2.28 0.761 0.877
## 4 drift Training 32.7 24.7 4.75 0.927 0.981
## 5 mean Training 30.5 24.6 6.84 1.46 1.32
## 6 mean Training 42.7 32.7 6.98 1.67 1.79
## 7 mean Training 57.9 47.6 9.54 2.99 3.10
## 8 mean Training 42.1 35.8 6.73 1.34 1.26
## 9 naive Training 23.1 16.9 4.75 1 1
## 10 naive Training 23.9 19.6 4.28 1 1
## 11 naive Training 18.6 15.9 3.03 1 1
## 12 naive Training 33.4 26.6 5.10 1 1
hh_wealth %>%
model(RW(Wealth ~ drift())) %>%
forecast(h = 10) %>%
autoplot(hh_wealth) +
labs(title = "Wealth as a Percentage of Net Disposable Income", subtitle = "Drift Forecast")
There appears to be some seasonality for USA and Canada, but this is likely due to the 2008 Recession. Overall every country seems to be dealing with shifting increases/decreases, so I’ll use the Drift method
[Australian takeaway food turnover (aus_retail)]
takeaway<- aus_retail %>%
filter(Industry == 'Takeaway food services') %>%
filter(State == 'Australian Capital Territory')
autoplot(takeaway, Turnover) +
labs(title = "Australian Takeaway Food Turnover")
takeaway %>%
model(RW(Turnover ~ drift())) %>%
forecast(h = 10) %>%
autoplot(takeaway) +
labs(title = "Australian Takeaway Food Turnover", subtitle = "Drift Forecast")
I see an upward trend with no seasonality, hence a Drift forecast.
a) Produce a time plot of the series.
fb_stock <- gafa_stock %>%
filter(Symbol == 'FB')%>%
mutate(day = row_number()) %>%
update_tsibble(index = day, regular = TRUE)
autoplot(fb_stock, Close) +
labs(title = "Facebook Closing Stock Price")
This dataset has irregular trading days, so the index had to be changed to only use trading days, similar to the Google example in Chapter 5.2.
b) Produce forecasts using the drift method and plot them.
fb_stock %>%
model(RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(fb_stock) +
labs(title = "Facebook Closing Stock Price", subtitle = "Drift Forecast")
c) Show that the forecasts are identical to extending the line drawn between the first and last observations.
# Calculate slope and intercept for the line between observations
first_close <- fb_stock$Close[1]
last_close <- fb_stock$Close[nrow(fb_stock)]
first_day <- fb_stock$day[1]
last_day <- fb_stock$day[nrow(fb_stock)]
slope <- (last_close - first_close) / (last_day - first_day)
intercept <- first_close - slope * first_day
# Data frame for the extended line
extended_line <- tibble(day = seq(first_day, last_day ),
Close = slope * day + intercept)
# Drift forecast
fb_stock_forecast <- fb_stock %>%
model(RW(Close ~ drift())) %>%
forecast(h = 100)
# Plot
autoplot(fb_stock_forecast, fb_stock) +
geom_line(data = extended_line, aes(x = day, y = Close), color = "green") +
labs(title = "Facebook Closing Stock Price", subtitle = "Drift Forecast") +
geom_text(data = tail(extended_line, 1), # last close of the extended line
aes(x = day, y = Close, label = sprintf("%.2f", Close)),
vjust = 5, hjust = 0.5, color = "black", size = 4)
d) Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
fb_stock %>%
model(Mean = MEAN(Close),
Naive = NAIVE(Close),
Snaive = SNAIVE(Close),
Drift = RW(Close ~ drift())) %>%
forecast(h = 100) %>%
autoplot(fb_stock) +
labs(title = "Facebook Closing Stock Price", subtitle = "Various Forecasts")
## Warning: 1 error encountered for Snaive
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning: Removed 100 rows containing missing values or values outside the scale range
## (`geom_line()`).
Of all these other forecasting methods, I can discount Naive and Snaive as there is no seasonality, leaving me with the mean and drift methods. The mean seems the worse of the two, as we can see there was a general increase in the data hitting $217.50 on 2018-07-25, before beginning to drop. This is likely due to the scandals that caused priced to drop 40%, and it’s uncertain where they would go. However, I believe it wisest to use the prior values, so I’ll stick with the Drift model.
# Extract data of interest
recent_production <- aus_production %>%
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>%
model(SNAIVE(Beer))
# residuals
fit %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
# forecast
fit %>% forecast() %>%
autoplot(recent_production)
#Ljung-Box Test, lag=2m for seasonal data, m=4 for our quarterly data
augment(fit) %>%
features(.innov, ljung_box, lag = 8)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Beer) 32.3 0.0000834
The Auto-correlation function (ACF) plot shows that the innovation residuals are somewhat centered around 0, although there is a large negative correlation at lag 4. This indicates the residuals are not completely random at this lag. The histogram of the residuals appears somewhat normal, and the time plot shows seasonality with a small decrease in the trend overall. Since the residuals are are centered around a mean of 0, there is no correlation between them, and this indicates there is no more information gain and the model accounts for all available information. The Ljung-Box Test, discussed in Chapter 5.4, looks for \(r_k\) (the autocorrelation for lag k) is very small, with a p-value of 0.0000834, which indicates this is true white noise.
[Australian Exports (global_economy)]
aus_exports <- global_economy %>%
filter(Code == "AUS") %>%
select("Exports")
autoplot(aus_exports, Exports) +
labs(title = "Australian Exports")
# No seasonal data
fit <- aus_exports %>%
model(NAIVE(Exports))
# residuals
fit %>%
gg_tsresiduals()
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
# forecast
fit %>% forecast() %>% autoplot(aus_exports) +
labs(title = "Australian Exports", subtitle = "Naive Forecast")
#Ljung-Box Test, lag=2m for seasonal data, m=4 for quarterly data
augment(fit) %>%
features(.innov, ljung_box, lag = 8)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 NAIVE(Exports) 15.5 0.0508
The Auto-correlation function (ACF) plot doesn’t appear to be following a pattern for the innovation residuals, with low negative correlations at lags of 1,4,7,12, and high positive auto correlations at lags of 5,8,13,16. The mean of the residuals is centered around 0, apart from the large outlier at lag 1. The histogram of the innovation residuals is perfectly normally distributed, and the Ljung Box test has a p-value of 0.05079555, so I believe this Naive forecast model is a good fit.
[Bricks series from aus_production]
bricks_data <- aus_production %>%
select(Quarter, Bricks) %>%
drop_na()
autoplot(bricks_data, Bricks) +
labs(title = "Brick Production")
fit <- bricks_data %>%
model(SNAIVE(Bricks))
# residuals
fit %>%
gg_tsresiduals()
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).
# forecasts
fit %>% forecast() %>% autoplot(bricks_data) +
labs(title = "Brick Production", subtitle = "SNaive Forecast")
#Ljung-Box Test, lag=2m for seasonal data, m=4 for our quarterly data
augment(fit) %>%
features(.innov, ljung_box, lag = 8)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 SNAIVE(Bricks) 274. 0
The Auto-correlation function (ACF) plot confirms that the data is seasonal, what with the wave-like fluctuations of the auto correlations for lags at a fixed interval. The histogram of residuals also appears to be normally-distributed around a mean of 0, although with a left-skew due to some large negative outliers. We can see these outliers in the regular quarterly autoplot, which appeared in 1975 and 1983, as well as the innovation residuals plot.
set.seed(22397)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`, 1))
autoplot(myseries, Turnover) +
labs(title = "Retail Turnover")
a) Create a training dataset consisting of observations before 2011 using
myseries_train <- myseries %>%
filter(year(Month) < 2011)
b) Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
c) Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train %>%
model(SNAIVE(Turnover))
f) Compare the accuracy of your forecasts against the actual values.
# actual value
fit %>%
accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western… Newspap… SNAIV… Trai… 0.838 5.67 4.09 2.49 16.0 1 1 0.849
#forecast
fc %>%
accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Newspap… Test 3.04 9.11 7.34 6.06 19.9 1.79 1.61 0.513
Interestingly, the forecast plot in (e) expects a downward trend to continue, as you can see the confidence intervals are slowly shifting downwards after 2010 January. This is due to the general trend decrease in the plot, but as we can see the actual data begins to increase towards 2020, with seasonality. By comparing the ME, RMSE, MAE, MPE, MAPE of my actual values against the forecast, I can see that the forecast is less accurate than actual data. Chapter 5.8 discusses these metrics, saying ‘When comparing forecast methods applied to a single time series, or to several time series with the same units, the MAE is popular as it is easy to both understand and compute. A forecast method that minimizes the MAE will lead to forecasts of the median, while minimizing the RMSE will lead to forecasts of the mean.’ I can see that my ACF value of 0.849, which is quite high, indicates that this forecasting model is not accounting for the full complexity between the innovation residuals, although this value is lower in my forecast, being 0.513. Overall, my testing data has lower performance than that of the training data.
g) How sensitive are the accuracy measures to the amount of training data used?
The accuracy measures depend on the proportion of training data used, which in this case was 345/441 records, so 78% of the original data. Since the RMSE and MAE are smaller for my training data than the testing data, and the data was trained using a Seasonal Naive method, it made predictions using the seasonal trends from 12 months prior. However, the model still has high auto-correlation between innovation residuals, along with a larger Mean absolute percentage error (MAPE). Because the model is built to reference seasonal data, it’s quite sensitive to data from prior time periods. The training data captured both the seasonality, along with the increasing trend-cycle, as well as a slight drop towards the end, before we began using test data. When looking at seasonal naive forecasting models, it’s more important to consider the variation within the prior seasons data, rather than the size of the training data.