DATA 624: Homework 3

Exercise 5.1

First, let’s plot the data for the australian population

aus_pop <- global_economy %>% filter(Country == "Australia") %>% select(Population)
autoplot(aus_pop, Population)

This data is not very seasonal, but has a pretty clear trendline. In this case, a drift model could be a good approach:

# Forecast with a drift model
drift_pop <- aus_pop %>% model(RW(Population ~ drift())) 

# Create and plot forecast
fc <- drift_pop |> forecast(h = 14)
fc %>% autoplot(aus_pop, level = NULL) + 
  autolayer(
    aus_pop,
    colour = "black"
  ) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Australian Population figures")

## Plot variable not specified, automatically selected `.vars = Population`

Now let’s do a similar exercise for the Bricks series

bricks <- aus_production %>% select(Bricks)

autoplot(bricks, Bricks)

## Warning: Removed 20 rows containing missing values (`geom_line()`).

This data looks more seasonal, so let’s us a seasonal-naive approach. We’ll need to first filter to ensure our model is not passed any NA values present in the original Bricks series

# Forecast with a seasonal-naive model: We need to filter out NA values
brick_fit <- bricks %>% filter_index(. ~ "2005 Q1") %>% model(SNAIVE(Bricks)) 

# Create and plot forecast of Aus brick production
brick_fc <- brick_fit |> forecast(h = 12)
brick_fc %>% autoplot(bricks, level=NULL) + 
  autolayer(
    filter_index(bricks, "2005 Q3" ~ .),
    colour = "black"
  ) + 
  guides(colour = guide_legend(title = "Forecast"))

## Plot variable not specified, automatically selected `.vars = Bricks`

Now let’s look at our Lambs fromt he aus_livestock data

lambs <- aus_livestock %>% filter (State == "New South Wales" & Animal == "Lambs")

lambs %>% autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

Our lamb production data looks like it could be seasonal. However, a naive approach could be effective here

# Forecast with a naive model
lamb_fit <- lambs %>% model(NAIVE(Count)) 

# Create and plot forecast of Aus brick production
lamb_fc <- lamb_fit |> forecast(h = 24)
lamb_fc %>% autoplot(lambs, level=NULL) + 
  guides(colour = guide_legend(title = "Forecast"))

Let’s repeat this process for household wealth (as a percentage of income:

wealth <- hh_budget %>% filter(Country == "Australia") %>% select(Wealth)

wealth |> autoplot()

## Plot variable not specified, automatically selected `.vars = Wealth`

This data fluctuates over the timespan, but isn’t very seasonal. The general trend makes me think that a drift method is apporporiate here:

# Forecast with a drift model
wealth_fit <- wealth %>% model(RW(Wealth ~ drift())) 

# Create and plot forecast
fc <- wealth_fit |> forecast(h = 5)
fc %>% autoplot(wealth, level = NULL) + 
  autolayer(
    wealth,
    colour = "black"
  ) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Australian household wealth (as percentage of income)")

Lastly, we can look at takeout food from aus_retail

takeout <- aus_retail %>% filter(Industry == "Takeaway food services") %>% index_by(yearmonth(Month)) %>% summarise(total_turnover=sum(Turnover))

autoplot(takeout, total_turnover)

This data looks to have a seasonal component visually, so we can use a SNAIVE model. However, there’s a strong trend as well. Let’s build both a drift ans SNAIVE model

# Forecast with a seasonal-naive model: We need to filter out NA values
takeout_fit <- takeout |>
  model(
    Naive = SNAIVE(total_turnover),
    Drift = RW(total_turnover ~ drift())
  )

# Create and plot forecast of Aus takeout turnover
takeout_fc <- takeout_fit |> forecast(h = 12)
takeout_fc %>% autoplot(takeout, level=NULL) + 
  autolayer(
    takeout,
    colour = "black"
  ) + 
  guides(colour = guide_legend(title = "Forecast"))

## Plot variable not specified, automatically selected `.vars = total_turnover`

Exercise 5.2

*Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series. Produce forecasts using the drift method and plot them. Show that the forecasts are identical to extending the line drawn between the first and last observations. Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

First we need t select the Facebook-only time series in our gafa_stock dataset. For simplicity, we’ll use the closing price (Close) used in the text.

facebook <- gafa_stock %>% filter(Symbol=="FB") %>% update_tsibble(regular = TRUE) %>% fill_gaps()

# Plot time series
facebook %>% autoplot(Close) + labs(x="Date", y="Stock Price ($)", title="Closing price of facebook stock")

Now let’s create a drift forecast for the facebook stock data. We have an irregular time series with some gaps, so we’ll fill those with the tsibble::fill_gaps method.

fb_fit <- facebook %>% model(Drift = RW(Close ~ drift()))

# now let's forecaste and plot our drift fit for 100 days out
fb_fc <- fb_fit |> forecast(h = 100)
fb_fc %>% autoplot(facebook, level = NULL) + 
  autolayer(
    facebook,
    colour = "black"
  ) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Drift forecast (blue) of Facebook closing stock price")

## Plot variable not specified, automatically selected `.vars = Open`

Now we’d need to show the drift forecast equals

price_min <- min(facebook$Close)
start_date <- min(facebook$Date)
end_date <- max(fb_fc$Date)
price_end <- max(fb_fc$.mean) # Since oru drift is strictly increasing, we can assume the ending price is the max

fb_fc %>%
  autoplot(facebook) +
  labs(title = "Drift forecast (blue) of Facebook closing stock price", y = "Stock Price ($)") +
  geom_segment(aes(x = start_date, y = price_min, xend = end_date, yend = price_end),
               colour = "orange", linetype = "dashed")

## Warning: Removed 1825 rows containing missing values (`geom_segment()`).

First, let’s try the NAIVE method, which can be good for financial data

fb_fit_naive <- facebook %>% model(Naive = NAIVE(Close))

fb_fc_naive <- fb_fit_naive |> forecast(h = 100)
fb_fc_naive %>% autoplot(facebook, level = NULL) + 
  autolayer(
    facebook,
    colour = "black"
  ) + guides(colour = guide_legend(title = "Forecast")) + labs(title="Naive forecast (blue) of Facebook closing stock price")

## Plot variable not specified, automatically selected `.vars = Open`

We’re dealing with stock price data, so seasonality isn’t as much of a concern here. If the data did in fact show regular seasonality, that would likely indicate that something is afoot, as that would violate the efficient markets hypothesis

Exercise 5.3

# From Hyndman text
# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

Our distribution of residuals looks roughly normally-shaped, although the two modal peaks make it appear slightly more off-normal than would be ideal. The ACF plot also shows a strong autocorrelation of residuals with a lag \(l = 4\). This makes sense as we’re dealing with quarterly data. In the time plot of residuals, there doesn’t seem to be an overall pattern, so we can feel more confident that the model is not significantly biased.

Exercise 5.4

Australian economy

We can run a seasonal naive model on the same Bricks series for the Asutralian production data

bricks_production <- aus_production %>% select(Bricks)
# Define and estimate a seasonal-naive model
fit <- bricks_production |> model(SNAIVE(Bricks))
fit |> gg_tsresiduals()

## Warning: Removed 24 rows containing missing values (`geom_line()`).

## Warning: Removed 24 rows containing missing values (`geom_point()`).

## Warning: Removed 24 rows containing non-finite values (`stat_bin()`).

# Plot forecasts
fit |> forecast() |> autoplot(bricks_production)

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 8 rows containing missing values (`()`).

## Warning: Removed 20 rows containing missing values (`geom_line()`).

We see a strong seasonal autocorrelation in the ACF plot above. Overall our residuals look reasonably normally-distribted (though with a slight tail on the left), and there doesn’t seem to be a strong pattern (with the exception of the 1980 recession).

Australian exports

First, let’s plot the austrlian exports from our global_economy datase

aus_exports <- global_economy %>% filter(Country=="Australia") %>% select(Exports)

autoplot(aus_exports)

## Plot variable not specified, automatically selected `.vars = Exports`

This dataset doesn’t appear to be strongly seasonal, so we’re better off running a NAIVE method here:

# Define and estimate a naive model
fit <- aus_exports |> model(NAIVE(Exports))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values (`geom_line()`).

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(aus_exports)

This looks like the residuals are pretty close to white noise. They’re normally distributed, and there doesn’t appear to be any overarching pattern in the innovation plot. In addition, the ACF plot doesn’t show any strong pattern in residuals vs lag.

Exercise 5.7

# Set up my time series
set.seed(12345678)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)
head(myseries_train)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State              Industry                      `Series ID`    Month Turnover
##   <chr>              <chr>                         <chr>          <mth>    <dbl>
## 1 Northern Territory Clothing, footwear and perso… A3349767W   1988 Apr      2.3
## 2 Northern Territory Clothing, footwear and perso… A3349767W   1988 May      2.9
## 3 Northern Territory Clothing, footwear and perso… A3349767W   1988 Jun      2.6
## 4 Northern Territory Clothing, footwear and perso… A3349767W   1988 Jul      2.8
## 5 Northern Territory Clothing, footwear and perso… A3349767W   1988 Aug      2.9
## 6 Northern Territory Clothing, footwear and perso… A3349767W   1988 Sep      3

Not we can plot turnover over time from the series selected

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

fit <- myseries_train |>
  model(SNAIVE(Turnover))

# Plot residuals
fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values (`geom_line()`).

## Warning: Removed 12 rows containing missing values (`geom_point()`).

## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

These residuals have some strong correlation, as shown by the pattern of the ACF plot. The residual distribution looks a bit skewed from normal as well. We can produce new forecasts below:

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

# Compare accuracy between fit and forecast
fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.439  1.21 0.915  5.23  12.4     1     1 0.768

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test  0.836  1.55  1.24  5.94  9.06  1.36  1.28 0.601

How sensitive are the accuracy measures to the amount of training data used? We can try training on a completely different series

# Sample from different training series
set.seed(12345678)
myseries1 <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`, 1))

# Set up new training series : from 2005 onwards instead of 2011
myseries_train1 <- myseries1 |>
  filter(year(Month) < 2005)
head(myseries_train1)

## # A tsibble: 6 x 5 [1M]
## # Key:       State, Industry [1]
##   State              Industry                      `Series ID`    Month Turnover
##   <chr>              <chr>                         <chr>          <mth>    <dbl>
## 1 Northern Territory Clothing, footwear and perso… A3349767W   1988 Apr      2.3
## 2 Northern Territory Clothing, footwear and perso… A3349767W   1988 May      2.9
## 3 Northern Territory Clothing, footwear and perso… A3349767W   1988 Jun      2.6
## 4 Northern Territory Clothing, footwear and perso… A3349767W   1988 Jul      2.8
## 5 Northern Territory Clothing, footwear and perso… A3349767W   1988 Aug      2.9
## 6 Northern Territory Clothing, footwear and perso… A3349767W   1988 Sep      3

# Fit TS model to new training series
fit1 <- myseries_train1 |>
  model(SNAIVE(Turnover))

# Plot residuals
fit1 |> gg_tsresiduals()

With less training data (2005 onwards instead of 2011, we see much more variance (uncertainty) in our forecast.

fc1 <- fit1 |>
  forecast(new_data = anti_join(myseries, myseries_train1))
fc1 |> autoplot(myseries)

# Compare accuracy between fit and forecast
fit1 |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.401  1.28 0.959  5.51  14.4     1     1 0.816

fc1 |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test   3.18  3.72  3.20  24.0  24.3  3.34  2.90 0.560

With less training data we also see a higher Root-mean squared error as well as Mean Absolute Error. This implies a worse fit, which makes sense as our model is in effect attempting to predict to a further time horizon. Let’s try this same analysis with a curtailed training set on the “front-end” (i.e., train on less data from the start of the time series)

# Set up new training series : from 2005 onwards instead of 2011
cutoff <- 2000
myseries_train2 <- myseries |>
  filter(year(Month) > cutoff  & year(Month) < 2011)

my_series <- myseries |>
  filter(year(Month) > cutoff)


# Fit TS model to new training series
fit2 <- myseries_train2 |>
  model(SNAIVE(Turnover))

# Forecast and plot
fc2 <- fit2 |>
  forecast(new_data = anti_join(my_series, myseries_train2))
fc2 |> autoplot(myseries)

Let’s look at our prediction accuracy metrics as well. We see a comparable RMSE value. This makes sense from a visual inspection of our data, as the decade of the 90s saw less of a pattern in turnover, with a large spike in 1996. This performance against less training data is more specific to this particular dataset.

# Compare accuracy between fit and forecast
fit2 |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.386 0.920 0.725  3.21  6.83     1     1 0.568

fc2 |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test  0.836  1.55  1.24  5.94  9.06  1.36  1.28 0.601