DATA 624 RB HW3

Exercises 5.1, 5.2, 5.3, 5.4 and 5.7 https://www.timescale.com/blog/what-is-time-series-forecasting

library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.3

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tsibble' was built under R version 4.3.3

## Warning: package 'tsibbledata' was built under R version 4.3.3

## Warning: package 'feasts' was built under R version 4.3.3

## Warning: package 'fabletools' was built under R version 4.3.3

## Warning: package 'fable' was built under R version 4.3.3

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(dplyr)
library(ggplot2)
library(tsibble)
library(fable)

——————————————————————————–

1. Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

[Australian Population (global_economy)]

aus_population  <- global_economy %>%
  filter(Code == 'AUS') %>%
  select(Year, Population) 

autoplot(aus_population , Population) +
  labs(title = "Total Population in Australia")

# Drift
aus_population  %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(aus_population) +
  labs(title = "Total Population in Australia", subtitle = "Drift Forecast")

# Checking accuracy
fit_mable <- aus_population %>%
  model(naive = NAIVE(Population),
        drift = RW(Population ~ drift()), 
        mean = MEAN(Population))

fc <- fit_mable %>%
  forecast(h = 10)

accuracy(fit_mable) %>%
     arrange(.model) %>%
     select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)

## # A tibble: 3 × 7
##   .model .type        RMSE      MAE   MAPE   MASE  RMSSE
##   <chr>  <chr>       <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
## 1 drift  Training   76870.   59124.  0.347  0.235  0.293
## 2 mean   Training 3979165. 3388690. 21.7   13.5   15.1  
## 3 naive  Training  262766.  251271.  1.52   1      1

Naive method doesn’t work as it uses the value of the prior observation, but this model shows exponential growth over time, so I used the drift method, which allows the forecast to increase or decrease over time. We can also see it has the lowest errors, thus the best accuracy

[Bricks (aus_production)]

bricks_data <- aus_production %>%
  select(Quarter, Bricks) %>%
  drop_na()

autoplot(bricks_data, Bricks) +
  labs(title = "Brick Production")

# SNAIVE
bricks_data %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 10) %>%
  autoplot(bricks_data) +
  labs(title = "Brick Production", subtitle = "SNAIVE Forecast")

We can see seasonal fluctuations, hence we use SNAIVE method.

[NSW Lambs (aus_livestock)]

nsw_lambs <- aus_livestock %>%
  filter(Animal == "Lambs") %>%
  filter(State == "New South Wales")

autoplot(nsw_lambs, Count) +
  labs(title = "Lambs slaughtered in New South Wales")

nsw_lambs %>%
  model(NAIVE(Count)) %>%
  forecast(h = 50) %>%
  autoplot(nsw_lambs) +
  labs(title = "Lambs slaughtered in New South Wales", subtitle = "NAIVE Forecast")

There is no seasonality, nor is there constant increase/decrease, so we’ll use the Naive method

[Household wealth (hh_budget)]

hh_wealth <- hh_budget %>%
  select(Year, Wealth) 


autoplot(hh_wealth, Wealth)  +
  labs(title = "Wealth as a Percentage of Net Disposable Income")

# Check which model to use 
fit_mable <- hh_wealth %>%
  model(naive = NAIVE(Wealth),
        drift = RW(Wealth ~ drift()), 
        mean = MEAN(Wealth))

fc <- fit_mable  %>%
  forecast(h = 10)

accuracy(fit_mable) %>%
    arrange(.model) %>% 
    select(.model, .type, RMSE, MAE, MAPE, MASE, RMSSE)

## # A tibble: 12 × 7
##    .model .type     RMSE   MAE  MAPE  MASE RMSSE
##    <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 drift  Training  22.6  15.6  4.44 0.923 0.975
##  2 drift  Training  22.6  17.7  3.90 0.903 0.946
##  3 drift  Training  16.4  12.1  2.28 0.761 0.877
##  4 drift  Training  32.7  24.7  4.75 0.927 0.981
##  5 mean   Training  30.5  24.6  6.84 1.46  1.32 
##  6 mean   Training  42.7  32.7  6.98 1.67  1.79 
##  7 mean   Training  57.9  47.6  9.54 2.99  3.10 
##  8 mean   Training  42.1  35.8  6.73 1.34  1.26 
##  9 naive  Training  23.1  16.9  4.75 1     1    
## 10 naive  Training  23.9  19.6  4.28 1     1    
## 11 naive  Training  18.6  15.9  3.03 1     1    
## 12 naive  Training  33.4  26.6  5.10 1     1

hh_wealth %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(hh_wealth)  +
  labs(title = "Wealth as a Percentage of Net Disposable Income", subtitle = "Drift Forecast")

There appears to be some seasonality for USA and Canada, but this is likely due to the 2008 Recession. Overall every country seems to be dealing with shifting increases/decreases, so I’ll use the Drift method

[Australian takeaway food turnover (aus_retail)]

takeaway<- aus_retail %>%
  filter(Industry == 'Takeaway food services') %>%
  filter(State == 'Australian Capital Territory')

autoplot(takeaway, Turnover)  +
  labs(title = "Australian Takeaway Food Turnover")

takeaway %>%
  model(RW(Turnover ~ drift())) %>%
  forecast(h = 10) %>%
  autoplot(takeaway)  +
  labs(title = "Australian Takeaway Food Turnover", subtitle = "Drift Forecast")

I see an upward trend with no seasonality, hence a Drift forecast.

——————————————————————————–

2. Use the Facebook stock price (data set gafa_stock) to do the following:

a) Produce a time plot of the series.

fb_stock <- gafa_stock %>%
  filter(Symbol == 'FB')%>%
  mutate(day = row_number())  %>%
  update_tsibble(index = day, regular = TRUE)

autoplot(fb_stock, Close)  +
  labs(title = "Facebook Closing Stock Price")

This dataset has irregular trading days, so the index had to be changed to only use trading days, similar to the Google example in Chapter 5.2.

b) Produce forecasts using the drift method and plot them.

fb_stock %>%
  model(RW(Close ~ drift())) %>%
  forecast(h = 100) %>%
  autoplot(fb_stock)  +
  labs(title = "Facebook Closing Stock Price", subtitle = "Drift Forecast")

c) Show that the forecasts are identical to extending the line drawn between the first and last observations.

# Calculate slope and intercept for the line between observations
first_close <- fb_stock$Close[1]
last_close <- fb_stock$Close[nrow(fb_stock)]
first_day <- fb_stock$day[1]
last_day <- fb_stock$day[nrow(fb_stock)]

slope <- (last_close - first_close) / (last_day - first_day)
intercept <- first_close - slope * first_day

# Data frame for the extended line
extended_line <- tibble(day = seq(first_day, last_day ),
                        Close = slope * day + intercept)

# Drift forecast
fb_stock_forecast <- fb_stock %>%
  model(RW(Close ~ drift())) %>%
  forecast(h = 100)

# Plot
autoplot(fb_stock_forecast, fb_stock) +
  geom_line(data = extended_line, aes(x = day, y = Close), color = "green") +
  labs(title = "Facebook Closing Stock Price", subtitle = "Drift Forecast") +
  geom_text(data = tail(extended_line, 1),  #  last close of the extended line
          aes(x = day, y = Close, label = sprintf("%.2f", Close)),
          vjust = 5, hjust = 0.5, color = "black", size = 4)

d) Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

fb_stock %>%
  model(Mean = MEAN(Close),
        Naive = NAIVE(Close),
        Snaive = SNAIVE(Close),
        Drift = RW(Close ~ drift())) %>%
  forecast(h = 100) %>%
  autoplot(fb_stock) +
  labs(title = "Facebook Closing Stock Price", subtitle = "Various Forecasts")

## Warning: 1 error encountered for Snaive
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 100 rows containing missing values or values outside the scale range
## (`geom_line()`).

Of all these other forecasting methods, I can discount Naive and Snaive as there is no seasonality, leaving me with the mean and drift methods. The mean seems the worse of the two, as we can see there was a general increase in the data hitting $217.50 on 2018-07-25, before beginning to drop. This is likely due to the scandals that caused priced to drop 40%, and it’s uncertain where they would go. However, I believe it wisest to use the prior values, so I’ll stick with the Drift model.

——————————————————————————–

3. Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help. What do you conclude?

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production %>% 
  model(SNAIVE(Beer))

# residuals
fit %>% 
  gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# forecast
fit %>% forecast() %>% 
  autoplot(recent_production)

#Ljung-Box Test, lag=2m for seasonal data, m=4 for our quarterly data
augment(fit) %>% 
  features(.innov, ljung_box, lag = 8)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    32.3 0.0000834

The Auto-correlation function (ACF) plot shows that the innovation residuals are somewhat centered around 0, although there is a large negative correlation at lag 4. This indicates the residuals are not completely random at this lag. The histogram of the residuals appears somewhat normal, and the time plot shows seasonality with a small decrease in the trend overall. Since the residuals are are centered around a mean of 0, there is no correlation between them, and this indicates there is no more information gain and the model accounts for all available information. The Ljung-Box Test, discussed in Chapter 5.4, looks for $r_k$ (the autocorrelation for lag k) is very small, with a p-value of 0.0000834, which indicates this is true white noise.

——————————————————————————–

4. Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

[Australian Exports (global_economy)]

aus_exports <- global_economy %>%
  filter(Code == "AUS") %>%
  select("Exports")

autoplot(aus_exports, Exports)  +
  labs(title = "Australian Exports")

# No seasonal data
fit <- aus_exports %>%
  model(NAIVE(Exports))

# residuals
fit %>% 
  gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# forecast
fit %>% forecast() %>% autoplot(aus_exports) +
  labs(title = "Australian Exports", subtitle = "Naive Forecast")

#Ljung-Box Test, lag=2m for seasonal data, m=4 for quarterly data
augment(fit) %>% 
  features(.innov, ljung_box, lag = 8)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 NAIVE(Exports)    15.5    0.0508

The Auto-correlation function (ACF) plot doesn’t appear to be following a pattern for the innovation residuals, with low negative correlations at lags of 1,4,7,12, and high positive auto correlations at lags of 5,8,13,16. The mean of the residuals is centered around 0, apart from the large outlier at lag 1. The histogram of the innovation residuals is perfectly normally distributed, and the Ljung Box test has a p-value of 0.05079555, so I believe this Naive forecast model is a good fit.

[Bricks series from aus_production]

bricks_data <- aus_production %>%
  select(Quarter, Bricks) %>%
  drop_na()

autoplot(bricks_data, Bricks) +
  labs(title = "Brick Production")

fit <- bricks_data %>%
  model(SNAIVE(Bricks))

# residuals
fit %>% 
  gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# forecasts
fit %>% forecast() %>% autoplot(bricks_data) +
  labs(title = "Brick Production", subtitle = "SNaive Forecast")

#Ljung-Box Test, lag=2m for seasonal data, m=4 for our quarterly data
augment(fit) %>% 
  features(.innov, ljung_box, lag = 8)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    274.         0

The Auto-correlation function (ACF) plot confirms that the data is seasonal, what with the wave-like fluctuations of the auto correlations for lags at a fixed interval. The histogram of residuals also appears to be normally-distributed around a mean of 0, although with a left-skew due to some large negative outliers. We can see these outliers in the regular quarterly autoplot, which appeared in 1975 and 1983, as well as the innovation residuals plot.

——————————————————————————–

7. For your retail time series (from Exercise 7 in Section 2.10):

set.seed(22397)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

autoplot(myseries, Turnover) +
  labs(title = "Retail Turnover")

a) Create a training dataset consisting of observations before 2011 using

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

b) Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

c) Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

d) Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

fit %>% 
  gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residuals look somewhat normally distributed within the histogram, with some negative skew. However, the Auto-correlation function (ACF) plot tells me there is auto-correlation within the innovation residuals, as they are not centered on a mean of 0. This means there is information left in the residuals which should be used in computing forecasts, and that the forecast currently being used is biased.

Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc %>%
  autoplot(myseries)

f) Compare the accuracy of your forecasts against the actual values.

# actual value
fit %>%
  accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Western… Newspap… SNAIV… Trai… 0.838  5.67  4.09  2.49  16.0     1     1 0.849

#forecast
fc %>% 
  accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Newspap… Test   3.04  9.11  7.34  6.06  19.9  1.79  1.61 0.513

Interestingly, the forecast plot in (e) expects a downward trend to continue, as you can see the confidence intervals are slowly shifting downwards after 2010 January. This is due to the general trend decrease in the plot, but as we can see the actual data begins to increase towards 2020, with seasonality. By comparing the ME, RMSE, MAE, MPE, MAPE of my actual values against the forecast, I can see that the forecast is less accurate than actual data. Chapter 5.8 discusses these metrics, saying ‘When comparing forecast methods applied to a single time series, or to several time series with the same units, the MAE is popular as it is easy to both understand and compute. A forecast method that minimizes the MAE will lead to forecasts of the median, while minimizing the RMSE will lead to forecasts of the mean.’ I can see that my ACF value of 0.849, which is quite high, indicates that this forecasting model is not accounting for the full complexity between the innovation residuals, although this value is lower in my forecast, being 0.513. Overall, my testing data has lower performance than that of the training data.

g) How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures depend on the proportion of training data used, which in this case was 345/441 records, so 78% of the original data. Since the RMSE and MAE are smaller for my training data than the testing data, and the data was trained using a Seasonal Naive method, it made predictions using the seasonal trends from 12 months prior. However, the model still has high auto-correlation between innovation residuals, along with a larger Mean absolute percentage error (MAPE). Because the model is built to reference seasonal data, it’s quite sensitive to data from prior time periods. The training data captured both the seasonality, along with the increasing trend-cycle, as well as a slight drop towards the end, before we began using test data. When looking at seasonal naive forecasting models, it’s more important to consider the variation within the prior seasons data, rather than the size of the training data.

DATA 624 RB HW3

Ron Balaban

2025-02-05

——————————————————————————–

1. Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

——————————————————————————–

2. Use the Facebook stock price (data set gafa_stock) to do the following:

——————————————————————————–

3. Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help. What do you conclude?

——————————————————————————–

4. Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

——————————————————————————–

7. For your retail time series (from Exercise 7 in Section 2.10):