Data-624 Homework 3

#Loading required R package and datasets:
library(fpp3)

## Warning: package 'fpp3' was built under R version 4.3.2

## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.4
## ✔ dplyr       1.1.3     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.1
## ✔ lubridate   1.9.3     ✔ fable       0.3.3
## ✔ ggplot2     3.4.4     ✔ fabletools  0.3.4

## Warning: package 'tsibble' was built under R version 4.3.2

## Warning: package 'tsibbledata' was built under R version 4.3.2

## Warning: package 'feasts' was built under R version 4.3.2

## Warning: package 'fabletools' was built under R version 4.3.2

## Warning: package 'fable' was built under R version 4.3.2

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(ggplot2)
library(seasonal)

## Warning: package 'seasonal' was built under R version 4.3.2

## 
## Attaching package: 'seasonal'

## The following object is masked from 'package:tibble':
## 
##     view

data("global_economy")

Question 5.1:

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

I have used RW(y ~ drift()) method, to show growth in the forecast.

global_economy %>% 
  filter(Country == "Australia") %>%
  model(RW(Population ~ drift())) %>%
  forecast(h = 5) %>%
  autoplot(global_economy) +
  labs(title = "Australia Population",
       subtitle = "1960 - 2017, Forecasted until 2022")

The chart suggests that the population of Australia has been growing steadily from 1960 to 2017 and is expected to continue to grow until 2022, with the forecasted values expressed with a certain level of confidence.

Bricks (aus_production)

I used SNAIVE() method, to find the forecast data from aus_production.

aus_production %>% 
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks)) %>%
  forecast(h = 14) %>%
  autoplot(aus_production) +
  labs(title = "Australian Bricks Production",
       subtitle = "1956 - 2005 Q2, Forecasted until 2008 Q4")

## Warning: Removed 20 rows containing missing values (`geom_line()`).

The forecast suggests increasing uncertainty in brick production as time moves forward, shown by the widening of the prediction intervals. There is an overall increasing trend in brick production from 1956 to around the late 1980s, after which the production seems to fluctuate more significantly.

NSW Lambs (aus_livestock)

Here I’m using NAIVE() method.

aus_livestock %>%
  filter(State == "New South Wales", 
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "Lambs in New South Wales",
       subtitle = "July 1976 - Dec 2018, Forecasted until Dec 2020")

The chart is a time series plot that depicts the count of lambs in New South Wales from July 1976 to December 2018, with a forecast extending to December 2020. The forecast suggests a wide range of possible outcomes for the future lamb count, with the prediction intervals widening as time progresses. This widening indicates increasing uncertainty in the forecast as we move further from the last historical data point.

Household wealth (hh_budget)

I am using RW(y ~ drift()) to find the data of Household wealth.

hh_budget %>%
  model(RW(Wealth ~ drift())) %>%
  forecast(h = 5) %>%
  autoplot(hh_budget) +
  labs(title = "Household Wealth",
       subtitle = "1996 - Dec 2016, Forecasted until 2021")

For all countries, the forecasted wealth is shown with two shades of blue, indicating the 80% and 95% prediction intervals. These intervals suggest the range within which the forecasted values are expected to lie with a certain level of confidence. The lighter shade represents a larger interval (95%), meaning there is less certainty that the wealth will fall within this range compared to the darker shade (80%).

Australian takeaway food turnover (aus_retail)

Using RW(y ~ drift()) method.

aus_retail %>%
  filter(Industry == "Cafes, restaurants and takeaway food services") %>%
  model(RW(Turnover ~ drift())) %>%
  forecast(h = 36) %>%
  autoplot(aus_retail) +
  labs(title = "Australian Takeaway Food Turnover",
       subtitle = "Apr 1982 - Dec 2018, Forecasted until Dec 2021") +
  facet_wrap(~State, scales = "free")

Across all regions, there is an overall positive trend in takeaway food turnover. The forecasted periods generally indicate expected growth, with varying degrees of uncertainty as shown by the width of the prediction intervals. The 80% prediction interval (darker shade) and 95% prediction interval (lighter shade) suggest that forecasters are more confident about the predictions closer to the historical data, with uncertainty growing as the forecast extends further into the future.

Question 5.2:

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

fb_stock <- gafa_stock %>%
  filter(Symbol == "FB") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

fb_stock%>%
  autoplot(Open) +
  labs(title= "Facebook Daily Open Price", y = "USD")

The stock price shows a general upward trend from the beginning of the period until around day 800, indicating a period of growth where the stock value increased. The chart shows a period of growth in Facebook’s opening stock price followed by a notable decline, with the time period covered likely spanning several years given the number of days indicated.

Produce forecasts using the drift method and plot them.

fb_stock %>%
  model(RW(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock) +
  labs(title = "Facebook Daily Open Price", y = "USD")

The data shows a general upward trend in the stock price, with some volatility, until a peak is reached. After the peak, there is a sharp decline. The forecast suggests that the price of Facebook’s stock could either stabilize or continue to decline, but the exact trajectory is uncertain, with a wide range of possible outcomes.

Show that the forecasts are identical to extending the line drawn between the first and last observations.

fb_stock %>%
  model(RW(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock) +
  labs(title = "Daily Open Price of Facebook", y = "USD") +
  geom_segment(aes(x = 1, y = 54.83, xend = 1258, yend = 134.45),
               colour = "blue", linetype = "dashed")

The chart shows the daily opening price of Facebook’s stock with an overall increasing trend in the opening price of the stock from the initial days up to a peak just before the day 1000. The forecast indicates that the model expects a possible continuation of the downward trend in the short term but with a significant level of uncertainty.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

I am using NAIVE() method instead of SNAIVE() method, as the data is not seasonal.

fb_stock %>%
  model(Mean = MEAN(Open),
        `Naïve` = NAIVE(Open),
        Drift = NAIVE(Open ~ drift())) %>%
  forecast(h = 63) %>%
  autoplot(fb_stock, level = NULL) +
  labs(title = "Daily Open Price of Facebook", y = "USD")

The daily open price of Facebook’s shares is shown on the chart for a period of time that looks to be approximately 1200 trading days. According to historical statistics, stock prices generally rise over time until they peak and then sharply decrease.

Question 5.3:

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help. What do you conclude?

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)

# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Beer Production")

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# Look at some forecasts
fit %>% forecast() %>% autoplot(recent_production)+
  ggtitle("Australian Beer Production")

#Box-Pierce test, L=2m for seasonal data, m=4
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model       bp_stat bp_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    29.7  0.000234

Performing Ljung-Box test:

fit %>%
  augment()%>% features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    32.3 0.0000834

Given the very small p-values, the tests indicate that the findings can be distinguished from a white noise series. The residuals appear to be centered around zero and have a constant variance, indicating that the results are not white noise. Lag 4 is larger than the others, as indicated by the ACF plot, and this can be explained by peaks happening every 4 quarters in Q4 and troughs happening every Q2.

Question 5.4:

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Extracting global_economy data:

# Extract data of interest
aus_exports <- global_economy %>%
  filter(Country == "Australia")

# Define and estimate a model
fit <- aus_exports %>% model(NAIVE(Exports))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Exports")

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_exports) +
  ggtitle("Annual Australian Exports")

#Box-Pierce test, L=10 for non-seasonal data
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 10, dof = 0)

## # A tibble: 1 × 4
##   Country   .model         bp_stat bp_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    14.6     0.148

Performing Ljung-Box test:

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 10, dof = 0)

## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    16.4    0.0896

The NAIVE() approach would be the most appropriate because the data is yearly. With the exception of 2000 to 2010, the residuals appear to vary constantly and have a mean that is very near to zero. There is some autocorrelation at lag 1, as the ACF diagram demonstrates. At a significance threshold of p=0.05, the Box-Pierce and Ljung-Box tests additionally demonstrate that the findings are not significant. This demonstrates that it is impossible to separate the residuals from white noise.

Extracting aus_production data:

# Define and estimate a model
fit <- aus_production %>% 
  filter(!is.na(Bricks)) %>% 
  model(SNAIVE(Bricks))

# Look at the residuals
fit %>% gg_tsresiduals() +
  ggtitle("Residual Plots for Australian Production of Bricks")

## Warning: Removed 4 rows containing missing values (`geom_line()`).

## Warning: Removed 4 rows containing missing values (`geom_point()`).

## Warning: Removed 4 rows containing non-finite values (`stat_bin()`).

# Look at some forecasts
fit %>% forecast() %>% autoplot(aus_production) +
  ggtitle("Australian Production of Bricks")

## Warning: Removed 20 rows containing missing values (`geom_line()`).

#Box-Pierce test, L=2m for seasonal data, m=4
fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model         bp_stat bp_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    267.         0

Performing Ljung-Box test:

#Ljung-Box test
fit %>%
  augment()%>% features(.innov, ljung_box, lag = 8, dof = 0)

## # A tibble: 1 × 3
##   .model         lb_stat lb_pvalue
##   <chr>            <dbl>     <dbl>
## 1 SNAIVE(Bricks)    274.         0

The SNAIVE() approach is the best option because brick manufacturing follows a seasonal cycle. The residuals can be distinguished from a white noise series based on the significant findings of the autocorrelation tests. Additionally, because the residuals are left skewed and not centered at 0, they do not follow a normal distribution. The ACF is intriguing as well because it appears to have waves.

Question 5.7:

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(12345678)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1)) 

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

fit %>% gg_tsresiduals()  +
  ggtitle("Residual Plots for Australian Retail Turnover")

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

As the ACF figure shows that the data has some autocorrelation. The residuals appear to be right skewed and are not centered at 0. They also don’t vary all the time. It doesn’t seem like the residuals are regularly distributed and uncorrelated.

Produce forecasts for the test data

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc %>% autoplot(myseries)

The upward trend in the actual data is not taken into account by the anticipated data. The actual data, however, seems to primarily lie inside the 80% confidence interval.

Compare the accuracy of your forecasts against the actual values.

fit %>% accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.439  1.21 0.915  5.23  12.4     1     1 0.768

fc %>% accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test  0.836  1.55  1.24  5.94  9.06  1.36  1.28 0.601

The model did quite well on the training set, according to the accuracy metrics, but its performance drastically declined on the test set. This implies that the model might not be generalizing to new data well enough, potentially as a result of overfitting to the training set or modifications to the underlying process of data generation that the model did not take into consideration.

How sensitive are the accuracy measures to the amount of training data used?

The amount of training data used affects accuracy measures, and how you split the data you used can also have an impact. There is improved accuracy with more training data available. There is a tipping point, though, beyond which an excess of training data will be detrimental. Additionally, it will rely on the model you are utilizing. Finding the shortest computed RMSE and doing a cross validation could be the best course of action in solving this problem.

Data-624 Homework 3

Mohammed Rahman

2024-02-17

Question 5.1:

Question 5.2:

Question 5.3:

Question 5.4:

Question 5.7: