HW3DATA624

Author

Semyon Toybis

Assignment

We are required to complete questions 5.1, 5.2, 5.3, 5.4, and 5.7 from chapter 5 of “Forecasting: Principles and Practice” Third Edition by Rob Hyndman and George Athanasopoulos.

5.1

We are tasked with using NAIVE(y)SNAIVE(y) or RW(y ~ drift()), which ever is most appropriate, for a variety of time series

Australian Population

First, I plot the data

global_economy |> filter(Country == 'Australia') |> autoplot(Population) + ggtitle('Australian Population')

The time series doesn’t have visible seasonality and has a positive upwards trend. The Drift method, which is equivalent to drawing a line between the first and last observation and extending it forward, may be most appropriate.

aus_pop <- global_economy |> filter(Country == 'Australia') 

aus_pop_fit <- aus_pop |>
  model(Drift = NAIVE(Population ~ drift()))

aus_pop_forecast <- aus_pop_fit |>
  forecast(h = '5 years')

aus_pop_forecast |> autoplot(aus_pop) + ggtitle('Australian Population plus five year forecast with confidence interval')

Bricks

First, I plot the data

aus_production |> select(Bricks) |> autoplot()
Plot variable not specified, automatically selected `.vars = Bricks`
Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).

The time series has seasonality and both trend and cyclicity. Given the seasonality, the Seasonal Naive method seems most appropriate, though capturing a potential change in trend would be difficult.

bricksData <- aus_production |> select(Bricks) |> drop_na()

bricks_fit <- bricksData |>
  model(Seaonal_Naive = SNAIVE(Bricks))

bricks_forecast <- bricks_fit |>
  forecast(h = '2 years')

bricks_forecast |> autoplot(bricksData) + ggtitle('Australian Brick Production plus two year forecast with confidence interval')

NSW Lambs

First, I plot the data

aus_livestock |> filter(Animal == 'Lambs', State == 'New South Wales') |> autoplot()
Plot variable not specified, automatically selected `.vars = Count`

There is both trend-cyclicity and seasonality in the data, so I will try the seasonal naive approach

lambData <- aus_livestock |> filter(Animal == 'Lambs', State == 'New South Wales')

lamb_fit <- lambData |>
  model(Seaonal_Naive = SNAIVE(Count))

lamb_forecast <- lamb_fit |>
  forecast(h = '2 years')

lamb_forecast |> autoplot(lambData) + ggtitle('New South Wales Lamb Count plus two year forecast with confidence interval')

Household wealth

First, I plot the data. I will choose to focus on USA household wealth

hh_budget |> filter(Country == 'USA') |> autoplot(Wealth)

There is an upward trend with cyclicity and no evident seasonality since the data is annual. The random walk drift method seems most appropriate.

us_wealth <- hh_budget |> filter(Country == 'USA')

us_wealth_fit <- us_wealth |>
  model(Drift = NAIVE(Wealth ~ drift()))

us_wealth_forecast <- us_wealth_fit |>
  forecast(h = '5 years')

us_wealth_forecast |> autoplot(us_wealth) + ggtitle('US household wealth plus five year forecast with confidence interval')

Australian takeaway food turnover

First, I plot the data. I choose to focus on Western Australia

aus_retail |> filter(State == 'Western Australia', Industry == 'Takeaway food services') |> autoplot(Turnover)

I check to see whether a log transformation makes the variance somewhat more consistent

aus_retail |> filter(State == 'Western Australia', Industry == 'Takeaway food services') |> autoplot(log(Turnover))

The variance here seems more consistent, so I will go with the log data. I will go with a naive seasonal model.

west_aus_takeaway <- aus_retail |> filter(State == 'Western Australia', Industry == 'Takeaway food services')

west_aus_takeaway$logTakeaway <- log(west_aus_takeaway$Turnover)

west_aus_takeaway_fit <- west_aus_takeaway |>
  model(Seaonal_Naive = SNAIVE(logTakeaway))

west_aus_takeaway_forecast <- west_aus_takeaway_fit |>
  forecast(h = '2 years')

west_aus_takeaway_forecast |> autoplot(west_aus_takeaway) + ggtitle('Western Australia Takeaway Food Services turnover plus two year forecast with confidence interval')

5.2

We are tasked on analyzing data on Facebook stock price data from the gafa_stock data set

A

Below I produce a plot of the time series. I will plot the Adj_Close price

gafa_stock |> filter(Symbol=='FB') |> autoplot(Adj_Close)

As discussed in the book, the Index for the time series needs to be adjusted as the data is not daily since there is no data for days when the market is closed.

fb_data <- gafa_stock |> filter(Symbol=='FB') |> mutate(day = row_number()) |>
  update_tsibble(index = day, regular = TRUE)

fb_data |> autoplot(Adj_Close)

B

Below I produce a forecast for the next 20 days using the drift method

fb_fit <- fb_data |>
  model(Drift = RW(Adj_Close~drift()))

fb_forecast <- fb_fit |> forecast(h = 20)

fb_forecast |> autoplot(fb_data) + ggtitle('FB Adj_Close price plus twenty forecast via drift method')

C

We are tasked with showing that the forecast above is equivalent to drawing a line between the first and last points in the time series.

subset_FB_Data <- fb_data[c(1,nrow(fb_data)),]

fb_forecast |> autoplot(fb_data) +
  geom_polygon(data = subset_FB_Data, aes(x= day, y = Adj_Close), color = 'red') +
  ggtitle('FB Adj_Close price plus twenty forecast via drift method')

As we can see in the above plot, the red line is the line between the first and last values of Adj_Close for FB and it connects to the blue line, which is the forecast of Adj_Close.

D

I will try the naive method to forecast the Adj_Close for FB

fb_fit_NAIVE <- fb_data |>
  model(naive = NAIVE(Adj_Close ))

fb_forecast_NAIVE <- fb_fit_NAIVE |> forecast(h = 20)

fb_forecast_NAIVE |> autoplot(fb_data, , level = F) + ggtitle('FB Adj_Close price plus twenty forecast via NAIVE method')
Warning: Plot argument `level` should be a numeric vector of levels to display.
Setting `level = NULL` will remove the intervals from the plot.

The Naive method takes the last value in the time series and assumes that future values will be the same as the last observed value. The Drift method is equal to the last value plus the average change in value that has been seen in the time series. While it is possible that over a smaller window, such as one day, that a stock price remains unchanged, it is unlikely that a stock price will remain unchanged over multiple days into the future. Thus, the Drift method would be more appropriate for a longer term forecast because it incorporates some change in the price of the stock.

5.3

We are asked to use a seasonal naive method on quarterly Australian beer production data and check whether the residuals look like white noise.

# Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).

# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

Based on the above residual diagnostic plot, we can see there is statistically significant negative correlation between the residuals at lag 4. Furthermore, the variance of the residuals is not constant. Based on this, the residuals do not seem to be white noise.

5.4

We are tasked with analyzing the residuals for two different time series

Australian exports

First, I plot the data to determine whether to use the NAIVE or SNAIVE model

global_economy |> filter(Country=='Australia') |> autoplot(Exports)

Based on the above, I will go with the NAIVE model, as the data is annual and thus there is no seasonality.

# Extract data of interest
aus_exports <- global_economy |> filter(Country=='Australia') 
# Define and estimate a model
aus_exports_fit <- aus_exports |> model(NAIVE(Exports))

# Look a some forecasts
aus_exports_fit |> forecast() |> autoplot(aus_exports)

# Look at the residuals
aus_exports_fit |> gg_tsresiduals()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).

Based on the residual diagnostics plot we can see that there is statically significant negative correlation at lag 1 and that the variance of the residuals is not constant. Thus, it seems like the residuals are not white noise.

Australian Bricks production

First, I plot the data to determine whether to use the NAIVE or SNAIVE model

aus_production |> select(Bricks) |> drop_na() |> autoplot(Bricks)

There does appear to be seasonality in the time series so I will go with the SNAIVE model

# Extract data of interest
aus_bricks <- aus_production |> select(Bricks) |> drop_na() 
# Define and estimate a model
aus_bricks_fit <- aus_bricks |> model(SNAIVE(Bricks))

# Look a some forecasts
aus_bricks_fit |> forecast() |> autoplot(aus_bricks)

# Look at the residuals
aus_bricks_fit |> gg_tsresiduals()
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).

Based on the residual diagnostic plot we see that there are several statistically significant correlations at various lags and that variance among the residuals is not constant. Thus, it seems like the residuals are not white noise.

5.7

We are tasked with creating a forecast model on one of the time series of retail data from aus_retail. I will select ‘Liquor retailing’ for ‘New South Wales’

aus_liquor_NSW <- aus_retail |> filter(Industry=='Liquor retailing', State=='New South Wales') 

aus_liquor_NSW |> autoplot(Turnover)

A - training data set

aus_liquor_NSW_train <- aus_liquor_NSW |>
  filter(year(Month) < 2011)

B - checking that data was split

autoplot(aus_liquor_NSW_train, Turnover) +
  autolayer(aus_liquor_NSW_train, Turnover, colour = "red")

The data does indeed end before 2011.

tail(aus_liquor_NSW_train)
# A tsibble: 6 x 5 [1M]
# Key:       State, Industry [1]
  State           Industry         `Series ID`    Month Turnover
  <chr>           <chr>            <chr>          <mth>    <dbl>
1 New South Wales Liquor retailing A3349627V   2010 Jul     207.
2 New South Wales Liquor retailing A3349627V   2010 Aug     213.
3 New South Wales Liquor retailing A3349627V   2010 Sep     224.
4 New South Wales Liquor retailing A3349627V   2010 Oct     239.
5 New South Wales Liquor retailing A3349627V   2010 Nov     251.
6 New South Wales Liquor retailing A3349627V   2010 Dec     380 

C - fitting an SNAIVE model

aus_liquor_NSW_fit <- aus_liquor_NSW_train |>
  model(SNAIVE(Turnover))

aus_liquor_NSW_fit
# A mable: 1 x 3
# Key:     State, Industry [1]
  State           Industry         `SNAIVE(Turnover)`
  <chr>           <chr>                       <model>
1 New South Wales Liquor retailing           <SNAIVE>

D - checking the residuals

aus_liquor_NSW_fit |> gg_tsresiduals()
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 12 rows containing non-finite outside the scale range
(`stat_bin()`).

There is statistically significant correlation at several lags and the variance of the residuals is not constant - the residuals are not white noise.

E - producing forecasts

aus_liquor_NSW_forecast <- aus_liquor_NSW_fit |>
  forecast(new_data = anti_join(aus_liquor_NSW, aus_liquor_NSW_train))
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
aus_liquor_NSW_forecast |> autoplot(aus_liquor_NSW)

The Forecasts are lower than the actual values - the forecast does not capture the increasing trend in the time series

F - comparing accuracy of forecasts vs actual values

aus_liquor_NSW_fit |> accuracy() 
# A tibble: 1 × 12
  State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New Sou… Liquor … SNAIV… Trai…  6.87  12.3  8.89  5.48  7.78     1     1 0.662

Using the MAPE measure, which converts the errors to percentage terms, we see there is a 7.8% error for the forecast vs actual values in the training set. This does not seem like a high error rate, though it is difficult to say in a vacuum as the tolerance for errors is case dependent.

aus_liquor_NSW_forecast |> accuracy(aus_liquor_NSW)
# A tibble: 1 × 12
  .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SNAIVE(T… New … Liquor … Test   47.6  51.3  47.6  16.8  16.8  5.35  4.18 0.741

The MAPE is 16.8% when predicting values in the test set and comparing them to actual values. This is higher than in the training set but still seems somewhat tolerable, though again the tolerance for error is case dependent and difficult to say in a vacuum.

G - How sensitive are the accuracy measures to the amount of training data used?

To test this, I will repeat the above exercise but with a a smaller training set

aus_liquor_NSW_train_small <- aus_liquor_NSW |>
  filter(year(Month) < 2000)

aus_liquor_NSW_fit_small <- aus_liquor_NSW_train_small |>
  model(SNAIVE(Turnover))


aus_liquor_NSW_forecast_small <- aus_liquor_NSW_fit_small |>
  forecast(new_data = anti_join(aus_liquor_NSW, aus_liquor_NSW_train_small))
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
aus_liquor_NSW_forecast_small |> autoplot(aus_liquor_NSW)

aus_liquor_NSW_fit_small |> accuracy() 
# A tibble: 1 × 12
  State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New Sou… Liquor … SNAIV… Trai…  3.42  8.75  6.13  4.36  7.80     1     1 0.742
aus_liquor_NSW_forecast_small |> accuracy(aus_liquor_NSW)
# A tibble: 1 × 12
  .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SNAIVE(T… New … Liquor … Test   116.  135.  116.  48.4  48.4  19.0  15.4 0.898

The MAPE for the errors in the test training data increases to 48.3% when using a smaller training set versus a MAPE of 16.8% in the test set when using a larger training set. Furthermore, the plot showing the actual values versus the forecast shows the larger divergence between forecast and actual values - the model is not capturing the upward trend and the less training data, the more significant this lack of trend becomes. Model accuracy is higher with a larger training set; however, we also need to be cautious to avoid over-fitting the model as that can also lead to poor out of sample performance.