HW3

5.1

Australian population

global_economy %>%
  filter(Country == "Australia") %>%
  autoplot(Population)

This plot appears appropriate for the drift method because of the upward trend.

aus_population <- global_economy %>%
  filter(Country == "Australia")

fit <- aus_population %>%
  model(Drift = RW(Population ~ drift()))

aus_population_forecast <- fit %>%
  forecast(h=5)

aus_population_forecast %>%
  autoplot(aus_population)

Brick production

bricks <- aus_production %>%
  # require to remove NAs due to warnings in later autoplot
  filter(!is.na(Bricks)) %>%
  select(Bricks)

autoplot(bricks)

## Plot variable not specified, automatically selected `.vars = Bricks`

There is strong seasonality in the production of Australian bricks so we’ll use SNAIVE.

bricks_fit <- bricks %>%
  model(snaive = SNAIVE(Bricks))

bricks_forecast <- bricks_fit %>%
  forecast(h=20)

bricks_forecast %>%
  autoplot(bricks)

NSW Lambs

lambs <- aus_livestock %>%
  filter(State == "New South Wales", Animal == "Lambs")

lambs %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Count`

Again there is strong seasonality so we will go for SNAIVE.

lambs_fit <- lambs %>%
  model(snaive = SNAIVE(Count))

lambs_fit %>%
  forecast(h=60) %>%
  autoplot(lambs)

Household wealth

wealth <- hh_budget %>%
  select(Wealth)

wealth %>%
  autoplot()

## Plot variable not specified, automatically selected `.vars = Wealth`

Because this is economic data we will use the naïve method, as recommended by the text.

wealth_fit <- wealth %>%
  model(naive = NAIVE(Wealth))

wealth_fit %>%
  forecast(h=5) %>%
  autoplot(wealth)

I could have selected out the countries but it’s interesting to see the different (albeit very similar) forecasts for each.

Australian takeaway food turnover

takeaway_food <- aus_retail %>%
  filter(Industry == "Takeaway food services") %>%
  summarize(Turnover = sum(Turnover)) %>%
  select(Turnover)

takeaway_food%>%
  autoplot(Turnover)

There is clear seasonality so we’ll use SNAIVE again.

food_fit <- takeaway_food %>%
  model(snaive = SNAIVE(Turnover))

food_forecast <- food_fit %>%
  forecast(h=60)

food_forecast %>%
  autoplot(takeaway_food)

5.2

fb <- gafa_stock %>%
  filter(Symbol == 'FB') %>%
  filter(!is.na(Adj_Close)) %>%
  # using the update_tsibbble and fill_gaps to remove errors in the forecasting step
  tsibble::update_tsibble(regular=TRUE) %>%
  tsibble:: fill_gaps()

autoplot(fb, Adj_Close)

fb_fit <- fb %>%
  model(`Drift` = RW(Adj_Close ~ drift()))

fb_forecast <- fb_fit %>%
  forecast(h=200)

fb_forecast %>%
  autoplot(fb) +
  theme(legend.position = 'none')

fb_new <- fb %>%
  filter_index("2014-01-02", "2018-12-31")
  
ggplot()+
  geom_line(data=fb, aes(x=Date,y=Adj_Close), color = 'black') +
  geom_line(data=fb_new, aes(x=Date,y=Adj_Close), color = 'blue')

We can see that the slope of the line from the first point to the last is the same as the drift forecast line (the slight difference we see is due to the scale of the charts).

fb_other_fit <- fb %>%
  model(
    Mean = MEAN(Adj_Close),
    `Naïve` = NAIVE(Adj_Close),
    `Seasonal naïve` = SNAIVE(Adj_Close)
  )

fb_other_fc <- fb_other_fit %>% forecast(h=100)

fb_other_fc %>%
  autoplot(fb)

## Warning: Removed 1 row(s) containing missing values (geom_path).

I still think drift is the best. The mean takes too many old values into consideration so has an artificially low forecast. Naive is better than mean but doesn’t have the slight upward trend that drift has, so the forecast isn’t considering the gradual growth of adjusted closing price. Seasonal naive is not appropriate as this data does not have seasonality.

5.3

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit %>% forecast() %>% autoplot(recent_production)

The mean of the residuals is close to zero and there does not appear to be significant correlation in the ACF plot. Variation of the residuals stays mostly the same across the data’s interval in the time plot. The residuals seem to have a bimodal distribution so prediction intervals that are computed assuming normal distribution might be wrong. The residuals don’t look like white noise and the forecast seems good.

5.4

aus_exports <- global_economy %>%
  filter(Country == 'Australia')

autoplot(aus_exports,Exports)

The data doesn’t appear seasonal so we’ll use NAIVE.

fit <- aus_exports %>% model(NAIVE(Exports))

fit %>% gg_tsresiduals()

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

fit %>% forecast() %>% autoplot(aus_exports)

The histogram of the residuals are normally distributed and the variation seems to be very slightly increasing over time in the timeplot. There is no correlation in the ACF plot so I would assume the forecasts are good and that prediction intervals computed assuming a normal distribution would be accurate.

aus_bricks <- aus_production %>%
  select(Bricks)
  
autoplot(aus_bricks)

## Plot variable not specified, automatically selected `.vars = Bricks`

## Warning: Removed 20 row(s) containing missing values (geom_path).

Brick production is seasonal so we’ll use SNAIVE.

fit <- aus_bricks %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks))

fit %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

fit %>% forecast() %>% autoplot(aus_bricks)

## Warning: Removed 20 row(s) containing missing values (geom_path).

There is a clearer increase in the variation of the residuals as compared to the previous residual time plots at which we’ve looked. Furthermore, the ACF plot is showing high levels of correlation that appears seasonal. The residual histogram has a long left tail – which means assumptions of a normal distribution will lead to incorrect calculations. This appears to be a biased forecast where some of the predictive data have been left in the residuals.

5.7

set.seed(81023948)

myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

fit <- myseries_train %>%
  model(SNAIVE(Turnover))

fit %>% gg_tsresiduals()

## Warning: Removed 12 row(s) containing missing values (geom_path).

## Warning: Removed 12 rows containing missing values (geom_point).

## Warning: Removed 12 rows containing non-finite values (stat_bin).

The residuals do appear uncorrelated and the distribution is mostly normal (perhaps setting different sized bins would improve that appearance).

fc <- fit %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")

fc %>% autoplot(myseries)

fit %>% accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Clothin… SNAIV… Trai… 0.349  1.39  1.07  2.41  9.93     1     1 0.550

fc %>% accuracy(myseries)

## # A tibble: 1 × 12
##   .model    State Industry .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>     <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Tasm… Clothin… Test   8.06  9.29  8.07  30.9  30.9  7.51  6.69 0.713

The forecast data seems to perform significantly worse than the training data with a 30.9% MAPE versus a 9.9% MAPE.

Accuracy measures are sensitive to the amount of data used as well as where the split occurred. There is much more variation post-2011, so the pre-2011 test data turned out to not be as accurate as the training data.