624assignment3.knit

Assignment 3

Daniel DeBonis

Question 1

library(fpp3)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.6
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.4     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0     ✔ readr   2.1.5
## ✔ purrr   1.0.4     ✔ stringr 1.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()     masks stats::filter()
## ✖ tsibble::interval() masks lubridate::interval()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Australian population

auspop <- global_economy |>
  filter(Country == "Australia") |>
  select(Year, Population)
autoplot(auspop)

## Plot variable not specified, automatically selected `.vars = Population`

With a clear, consistant historical rising trend, a drift model of extrapolation would work best. It is a fair assumption for purposes of forecasting that this rate of change would remain fairly consistent.

auspop |> 
  model((RW(Population ~ drift()))) |>
            forecast(h = 20) |>
autoplot(auspop, level=95)

Australian Brick Production

ausbrick <- aus_production |>
  select(Bricks, Quarter) |>
  na.omit()
autoplot(ausbrick)

## Plot variable not specified, automatically selected `.vars = Bricks`

There appears to be seasonality within each year cycle here, so a seasonal naive model would be ideal for forecasting in this case.

ausbrick |>
  model(SNAIVE(Bricks ~ lag("year"))) |>
  forecast (h = 20) |>
autoplot(ausbrick, level = 95)

Lambs of New South Wales

nsw_lamb <- aus_livestock |>
  filter(Animal == "Lambs", State == "New South Wales") |>
   select(Count)
autoplot(nsw_lamb)

## Plot variable not specified, automatically selected `.vars = Count`

There is seasonality apparent in the data, so again we should use a seasonal naive model for forecasting.

nsw_lamb |>
  model(SNAIVE(Count)) |>
  forecast(h = 30) |>
  autoplot(nsw_lamb, level = 95)

Household wealth

hhwealth <- hh_budget |>
  select(Wealth) |>
  summarize(Wealth = sum(Wealth))
autoplot(hhwealth)

## Plot variable not specified, automatically selected `.vars = Wealth`

There are some cycles in this trend, but they are not regularly spaced and therefore not seasonal in nature. A drift forecast would be the most useful for this data.

hhwealth |>
  model(RW(Wealth ~ drift())) |>
  forecast (h = 5) |>
autoplot(hhwealth, level = 95)

Turnover in Australian takeaway jobs

takeaway_turnover <- aus_retail |>
  filter(Industry == "Cafes, restaurants and takeaway food services") |>
  group_by(Industry) |>
  summarize(Turnover = sum(Turnover))
autoplot(takeaway_turnover)

## Plot variable not specified, automatically selected `.vars = Turnover`

There is a clear seasonality to this data, again the seasonal naive model would produce the strongest prediction of the options.

takeaway_turnover |>
  model(SNAIVE(Turnover)) |>
  forecast(h = 30) |>
  autoplot(takeaway_turnover, level = 95)

Question 2

Time Series Plot of FB Stock

fbstock <- gafa_stock |>
  filter(Symbol == "FB")
# We need to convert to a tsibble to make the model work
fbstockts <- as_tsibble(fbstock, key = "Symbol", index = "Date", regular = TRUE) |>
  fill_gaps()
  autoplot(fbstockts, Close)

It is not specified whether the open, close, high, or low price for each day is what should be analyzed, so I chose the closing price.

Drift Model Predictions

temp <- fbstockts |>
  model(RW(Close ~ drift())) |>
  forecast (h = 100) |>
  autoplot(fbstockts)
temp

Compare to slope of line formed by start and end points

date <- as.Date('2014/02/01')
y0 <- 54.71
date2 <- as.Date('2018/12/31')
y1 <- 131.09
fbstockts |>
  model(RW(Close ~ drift())) |>
  forecast (h = 100) |>
  autoplot(fbstockts) + 
    geom_segment(aes(x=date, y=y0, xend=date2, yend=y1, col = "red", lwd = 3))

## Warning in geom_segment(aes(x = date, y = y0, xend = date2, yend = y1, col = "red", : All aesthetics have length 1, but the data has 1825 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
##   a single row.

As expected, the line touches the first and last points on the graph. #### Looking at other models

fbstockts |>
  model(NAIVE(Close)) |>
  forecast (h = 100) |>
  autoplot(fbstockts)

With the unpredictability due to the lack of observable seasonality or cyclicity, the Naive model would seem appropriate, though I am not sure that the predictions are much better than the drift model.

fbstockts |>
  model(SNAIVE(Close)) |>
  forecast (h = 100) |>
  autoplot(fbstockts)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

Without much seasonality observed in the data, it is no surprise that the seasonal naive model does not hold up.

Question 3

We start with what the textbook provides

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

With seasonality present in our data, it looks like our forecasts are not just white noise. The up-down pattern shown in our plot of residuals is in line with the observed seasonality. Furthermore, with the spike at a lag of 4 quarters in the autocorrelation graph is also strong evidence of year-based seasonality, which should serve well for seasonal naive forecasting.

Question 4

Repeating for Australian exports

aus_exp <- global_economy |>
    filter(Country == "Australia")
aus_expts <- as_tsibble(aus_exp, key = "Country", index = "Year", regular = TRUE) |>
  fill_gaps()
ggplot(data = aus_expts) +
  geom_line(mapping = aes(x = Year, y = Exports))

We can compare the naive and the seasonal naive methods to see which provides a better forecast.

naivem <- aus_expts |>
  model(NAIVE(Exports))
naivem |> 
  gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Looking at the autocorrelation graph, the spike crosses the blue line slightly, once, which is not strong evidence for our naive forecasting model.

naivem |>
  forecast(h = 10) |>
  autoplot(aus_expts)

I tried to use the SNAIVE model for comparison, but the model is giving me an error since it is picking up the data as yearly, and nonseasonal, so this model is even less useful.

Repeating for Brick Production Data

We already generated this data set (“ausbrick”) for an earlier part of the assignment. It was determined that there was seasonality present in the data, and therefore the SNAIVE method would be superior.

snm <- ausbrick |>
  model(SNAIVE(Bricks))
snm |>
  gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

The autocorrelation shows the strength of the seasonal model, with the data often peaking past the lines. The seasonality is also visible in both the acf and residual plots. This data is clearly not white noise.

snm |>
  forecast(h = 10) |>
  autoplot(ausbrick)

Question 7

I will try to use the same data set that I used last week.

 set.seed(464319)
 myseries <-aus_retail |>
 filter(`Series ID` == sample(aus_retail$`Series ID`,1))

After viewing the data, I can confirm it is the same series: footwear in South Australia.Now to proceed with the textbook’s instructions.

myseries_train <- myseries |>
  filter(year(Month) < 2011)
autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

fit <- myseries_train |>
  model(SNAIVE(Turnover))
fit |>
  gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residuals have a high correlation according to the acf, and the residual plot is close to normal in its appearance.

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South A… Footwea… SNAIV… Trai… 0.959  2.19  1.62  5.08  8.23     1     1 0.689

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Sout… Footwea… Test   8.60  9.80  8.60  19.4  19.4  5.32  4.48 0.400

The Mean Absolute Scaled Error is 5.32 in comparing the predicted values to the actual values. This is higher than the other model, as well as most other relevant calculations given.

Increasing amount of observations used in training to see if that affects model.

myseries_train2 <- myseries |>
  filter(year(Month) < 2017)

fit <- myseries_train2 |>
  model(SNAIVE(Turnover))
fit |>
  gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train2))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

I think that this shows that the number of observations used in training can have an impact on the quality of your model. In this case, we see more variance in the later years now included in our training model compared to cutting off at 2011. As a result, our predictions are lower than they should be since they include so much data from earlier, which lower values and less variance (though maintaining seasonality).

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South A… Footwea… SNAIV… Trai…  1.11  3.01  2.12  4.81  8.67     1     1 0.748

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model   State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>    <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 SNAIVE(… Sout… Footwea… Test  0.421  2.81  1.97 0.929  4.51 0.930 0.936 -0.356

Perhaps the model did improve more than I expected. The MASE of the new model is under 1. Our model did not include the extremes seen in the data, but it did cover the rest of the points better.