Excercise 5.1:

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)
Bricks (aus_production)
NSW Lambs (aus_livestock)
Household wealth (hh_budget).
Australian takeaway food turnover (aus_retail).

# Count missing values per column
missing_per_column <- colSums(is.na(global_economy))
print("Missing values per column:")

## [1] "Missing values per column:"

print(missing_per_column)

##    Country       Code       Year        GDP     Growth        CPI    Imports 
##          0          0          0       3322       3756       7480       4554 
##    Exports Population 
##       4563          3

# check Australias population trend
aus_population <- global_economy %>%
    filter(Country == "Australia")

autoplot(aus_population)

## Plot variable not specified, automatically selected `.vars = GDP`

We see increasing trend (population), so using RW(y ~ drift())

aus_economy <- global_economy %>%
    filter(Code == "AUS")

aus_economy %>%
  model(Drift = RW(Population ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(aus_economy) +
    labs(title = "Australian Population Forecast")

Bricks (aus_production)

data("aus_production")

view(aus_production)

# Count total missing values
total_missing <- sum(is.na(aus_production))
print(paste("Total missing values:", total_missing))

## [1] "Total missing values: 44"

# Count missing values per column
missing_per_column <- colSums(is.na(aus_production))
print("Missing values per column:")

## [1] "Missing values per column:"

print(missing_per_column)

##     Quarter        Beer     Tobacco      Bricks      Cement Electricity 
##           0           0          24          20           0           0 
##         Gas 
##           0

data here is seasonal in quarters so we will the SNAIVE method

autoplot(aus_production, Bricks) +
  labs( title = "Quarterly Production of Bricks in Australia")

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production %>%
  filter(!is.na(Bricks)) %>%
  model(SNAIVE(Bricks ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_production) +
    labs(title = "Australian Bricks Production Forecast")

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

NSW Lambs (aus_livestock)

aus_livestock %>%
  filter(State == "New South Wales", 
         Animal == "Lambs") %>%
  model(NAIVE(Count)) %>%
  forecast(h = 24) %>%
  autoplot(aus_livestock) +
  labs(title = "New South Wales Lambs") +
  #scale_y_continuous(labels = label_number(scale = 1e-3, suffix = "K")) +
  theme_minimal()

Household wealth (hh_budget)

hh_budget %>% 
  autoplot(Wealth)

we see has a positive trend, so we’ll use the RW(y ~ drift()) method.

hh_budget %>%
  model(Drift = RW(Wealth ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(hh_budget) +
    labs(title = "Household wealth Forecast")

We see seasonality per year in the above chart so using SNAIVE here

aus_retail %>%
  filter(State == "South Australia",
         Industry == 'Takeaway food services') %>%
  model(SNAIVE(Turnover ~ lag("year"))) %>%
  forecast(h = 15) %>%
  autoplot(aus_retail) +
    labs(title = "South Australia Takeout Turnover Forecast")

Excercise 5.2:

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.
Produce forecasts using the drift method and plot them.
Show that the forecasts are identical to extending the line drawn between the first and last observations.
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

data("gafa_stock")
unique(gafa_stock$Symbol)

## [1] "AAPL" "AMZN" "FB"   "GOOG"

gafa_stock %>%
  filter( Symbol == "FB") %>%
  autoplot(Close) +
  labs(y = "USD", title = "Closing FB Prices")

Produce forecasts using the drift method and plot them.

fb <- gafa_stock %>%
  filter( Symbol == "FB")

fb_tibble <- as_tsibble(fb, key = "Symbol", index = "Date", regular = TRUE) %>% fill_gaps()

fb_tibble %>%
  model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 15) %>%
  autoplot(fb_tibble) +
    labs(title = "FB Closing Prices Forecast")

Show that the forecasts are identical to extending the line drawn between the first and last observations.

df <- data.frame(x1 = as.Date('2014-01-02'), x2 = as.Date('2018-12-31'), y1 = 54.71, y2 = 131.09)

fb_tibble %>%
model(Drift = RW(Close ~ drift())) %>%
  forecast(h = 120) %>%
  autoplot(fb) +
    labs(title = "FB Closing Price Forecast") +
  geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = df)

We can conclude that the line between the first and last observation does match the forecasts.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

fb_tibble %>%
  model(
      Mean = MEAN(Close),
      Naive = NAIVE(Close),
      Drift = RW(Close ~ drift())
  ) %>%
  forecast(h = 120) %>%
  autoplot(fb_tibble) +
    labs(title = "South Australian Takeout Turnover Forecast")

We can say that naive benchmark has the smallest prediction interval, so the most suitable method here

Excercise 5.3:

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
recent_production

## # A tsibble: 74 x 7 [1Q]
##    Quarter  Beer Tobacco Bricks Cement Electricity   Gas
##      <qtr> <dbl>   <dbl>  <dbl>  <dbl>       <dbl> <dbl>
##  1 1992 Q1   443    5777    383   1289       38332   117
##  2 1992 Q2   410    5853    404   1501       39774   151
##  3 1992 Q3   420    6416    446   1539       42246   175
##  4 1992 Q4   532    5825    420   1568       38498   129
##  5 1993 Q1   433    5724    394   1450       39460   116
##  6 1993 Q2   421    6036    462   1668       41356   149
##  7 1993 Q3   410    6570    475   1648       42949   163
##  8 1993 Q4   512    5675    443   1863       40974   138
##  9 1994 Q1   449    5311    421   1468       40162   127
## 10 1994 Q2   381    5717    475   1755       41199   159
## # ℹ 64 more rows

fit <- recent_production |> model(SNAIVE(Beer))

Look at the residuals

fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

Look at some forecasts

fit |> forecast() |> autoplot(recent_production)

What do you conclude? Results are not white noise.Looks like the values are not independent and random. We see very gradual downwards trend in beer production. Because of this the snaive model is overestimating future beer production from the prior year’s.

Excercise 5.4:

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

data("global_economy")

aus_exports <- global_economy |>
  filter(Country == "Australia")

# Define and estimate a model
fit_aus_exports <- aus_exports |> model(NAIVE(Exports))
# Look at the residuals
fit_aus_exports |> gg_tsresiduals()

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

# forecasts
fit_aus_exports |> forecast() |> autoplot(aus_exports)

I think the native method was more viable as it is an economic time series. This data set does seem to contain white noise as the graph should a large count of residuals at 0. The lags are also close to the dashed line, we can conclude that the forecast is not bias.

fit <- aus_production |>
  filter(!is.na(Bricks)) |>
  model(SNAIVE(Bricks ~ lag("year")))
# Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 4 rows containing non-finite outside the scale range
## (`stat_bin()`).

There is a high autocorrelation between multiple lags, as well as a seasonal pattern. The histogram is also not uniform, meaning that the model is not suitable for the time series.

fit |> forecast() |> autoplot(aus_production)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

Excercise 5.7:

For your retail time series (from Exercise 7 in Section 2.10): Create a training dataset consisting of observations before 2011 using

set.seed(10)
myseries <- aus_retail |>
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "green")

the data has been split correctly (split is from 2011 onwards)

Fit a seasonal naïve model using SNAIVE() applied to your training data

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.Do the residuals appear to be uncorrelated and normally distributed?

fit |> gg_tsresiduals()

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_bin()`).

The residuals are autocorrelated and are not normally distributed, they are right-tailed.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Footwea… SNAIV… Trai… 0.197 0.458 0.341  7.50  13.2     1     1 0.566

The accuracy measures from the training set is smaller in comparison.

How sensitive are the accuracy measures to the amount of training data used?

The errors on the training data are much smaller than the test data. The model performed poorly on the test data, which had fewer points. We can say that the test training dataset performed better than the test dataset.

Data 624 HW3

Deepa Sharma

2024-09-16

Excercise 5.1:

Excercise 5.2:

Excercise 5.3:

Excercise 5.4:

Excercise 5.7: