DATA624

Exercise 1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

global_economy_hw <- global_economy %>% mutate(Country = as.character(Country))
global_economy_hw <- global_economy_hw |> filter(Country == 'Austria') |> select(Year, Country, Population)  |> as_tsibble(index = Year, key = Country)

#global_economy_hw |> autoplot()
pop_mod <- global_economy_hw |> model(RW(Population ~ drift()))

pop_fc <- pop_mod |> forecast()

pop_fc |> autoplot() +
  autolayer(global_economy_hw,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Population`

Bricks (aus_production)

#global_economy_hw |> autoplot()
brc_df <- aus_production |> select(Bricks)
brc_mod <- brc_df |> filter(!is.na(Bricks)) |> model(SNAIVE(Bricks))

brc_fc <- brc_mod |> forecast()

brc_fc |> autoplot() +
  autolayer(brc_df,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

NSW Lambs (aus_livestock)

lamb_df <- aus_livestock |> filter(State == 'New South Wales', Animal == 'Lambs')
lamb_mod <- lamb_df |> model(MEAN(Count))

lamb_fc <- lamb_mod |> forecast()

lamb_fc |> autoplot() +
  autolayer(lamb_df,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Count`

Household wealth (hh_budget).

house_df <- hh_budget |> select(Year, Wealth) |> as_tsibble(index = Year)

house_mod <- house_df |> model(RW(Wealth ~ drift()) )

house_fc <- house_mod |> forecast()

house_fc |> autoplot() +
  autolayer(house_df,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Wealth`

Australian takeaway food turnover (aus_retail).

Here I grouped by Month to and created a total turnover for Australian called ‘Tot_Turnover’

retail_df <- aus_retail |> filter(Industry == 'Takeaway food services') |> as.data.frame() |> group_by(Month) |> mutate(Tot_Turnover = sum(Turnover)) %>% select(Month, Tot_Turnover) |> ungroup() |> distinct() |> as_tsibble(index = Month)

retail_mod <- retail_df |> model(RW(Tot_Turnover ~ drift()) )

retail_fc <- retail_mod |> forecast()

retail_fc |> autoplot() +
  autolayer(retail_df,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Tot_Turnover`

Exercise 2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

fb_df <- gafa_stock |> filter(Symbol == 'FB') %>% mutate(new_index = row_number()) %>% select(new_index, Close) |> tsibble(index = new_index)

fb_df |> autoplot()

## Plot variable not specified, automatically selected `.vars = Close`

fb_mod <- fb_df |> model(RW(Close ~ drift()))

Produce forecasts using the drift method and plot them.

fb_fc <- fb_mod |> forecast(h=50)


fb_fc |> autoplot() +
  autolayer(fb_df,
    colour = "black"
  )

## Plot variable not specified, automatically selected `.vars = Close`

Show that the forecasts are identical to extending the line drawn between the first and last observations.

Note to self: I had to switch the order of the plots to make the geom_segment appear

fb_df |> autoplot() + autolayer(fb_fc,
                                colour = "blue") +
  geom_segment(
    x = min(fb_df$new_index),
    y = fb_df$Close[which.min(fb_df$new_index)],
    xend = max(fb_df$new_index),
    yend = fb_df$Close[which.max(fb_df$new_index)], colour = "red"
  )

## Plot variable not specified, automatically selected `.vars = Close`

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

This is an interesting question, based on the efficient market theory the last value is a best estimate of what the stock is worth so NAIVE forecast might be the best estimate.

Then again the stock is in a downward trend so maybe the historic mean which is lower is a better estimate.

Possibly this trend will reverse and the Random Walk is the best forecast? I don’t think there is a clear answer without some forecast accuracy testing.

fb_mod2 <- fb_df |> model(RW(Close ~ drift()), NAIVE(Close), MEAN(Close))

fb_fc2 <- fb_mod2 |> forecast(h=150)

fb_fc2 |> autoplot() + autolayer(fb_df,
                                colour = "black")

## Plot variable not specified, automatically selected `.vars = Close`

Exercise 3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

What do you conclude?

The residuals appear to be uncorrelated and have a mean of zero. In the afc plot, there is only one value that crosses 0.2 further supporting the claim that the residuals are uncorrelated. The histogram of the residuals is slightly bimodal but not skewed showing that it is centered on zero.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

Exercise 4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

Wow, the Bricks time series has some issues, the residuals are not centered on zero and are also correlated based on the acf plot. I would recommend a transformation.

# Extract data of interest
brc_df <- aus_production |> select(Bricks)

# Define and estimate a model
brc_mod <- brc_df |> filter(!is.na(Bricks)) |> model(SNAIVE(Bricks))

# Look at the residuals
brc_mod |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
brc_mod |> forecast() |> autoplot(brc_df)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

The Exports data appears to have uncorrelated residuals and have residuals that are centered around zero. The acf plot shows no correlation and the histogram is approaching a normal distribution.

# Extract data of interest
global_economy_hw <- global_economy %>% mutate(Country = as.character(Country))
global_economy_hw <- global_economy_hw |> filter(Country == 'Austria') |> select(Year, Country, Exports)  |> as_tsibble(index = Year, key = Country)
# Define and estimate a model
exprt_mod <- global_economy_hw  |> model(NAIVE(Exports))

# Look at the residuals
exprt_mod |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# Look a some forecasts
exprt_mod |> forecast() |> autoplot(global_economy_hw)

Exercise 7

For your retail time series (from Exercise 7 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(555)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals. Do the residuals appear to be uncorrelated and normally distributed?

The residuals are NOT uncorrelated and are NOT normally distributed.

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

The accuracy on the test data is much lower compared to the full dataset. I think this might be due in part to the fact that the residuals are correlated and not centered around zero.

fit |> accuracy()

## # A tibble: 1 × 12
##   State   Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
##   <chr>   <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 Wester… Departm… SNAIV… Trai…  4.76  8.10  6.26  4.82  6.25     1     1 0.0518

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Departm… Test   1.73  7.97  6.24 0.746  3.57 0.996 0.983 0.107

How sensitive are the accuracy measures to the amount of training data used?

I’m going to reduce the amount of training data (cut of the training data in 2000) and see how that changes the accuracy measures.

myseries_train2 <- myseries |>
  filter(year(Month) < 2000)
  
fit2 <- myseries_train2 |>
  model(SNAIVE(Turnover))
  
fc2 <- fit2 |>
  forecast(new_data = anti_join(myseries, myseries_train2))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc2 |> autoplot(myseries)

The accuracy measures are sensitive to the quality of the model. By reducing the training data to the point that I missed an important trend means that that my model was way off resulting in a massive decrease in accuracy. The RSME went from 6.1 to 58.7.

fit2 |> accuracy()

## # A tibble: 1 × 12
##   State  Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>  <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 Weste… Departm… SNAIV… Trai…  3.87  6.12  4.94  5.20  6.54     1     1 -0.0404

fc2 |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… West… Departm… Test   51.8  58.7  52.0  31.1  31.3  10.5  9.60 0.726

DATA624_HW3

William Aiken

9/22/2024

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 7