data624-assigment3

5.1 Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

first of all, i pull the Australia data under country and plot it. As we can see, the population is a rising continuously.

From the plot, RW(y ~ drift()) is the best one to use because drift is a variation on the naïve method is to allow the forecasts to increase or decrease over time (from 5.2). it is set to be the average change seen in the historical data.

global_economy %>% 
  filter(Country == "Australia") %>% 
  autoplot(Population)

global_economy %>%
  filter(Country == "Australia") %>% 
  model(Drift = RW(Population ~ drift())) %>% 
  forecast(h = 10) %>% 
  autoplot(global_economy)

Bricks (aus_production)

From the plot we can see the data is highly seasonal comparing with the last question. In this case, snaive is more suitable to use. This method is useful for highly seasonal data. this allow us to set each forecast to be equal to the last observed value from the same season.

aus_production %>%
  autoplot(Bricks)

## Warning: Removed 20 row(s) containing missing values (geom_path).

aus_production %>% 
  drop_na(Bricks) %>% 
  model(snaive = SNAIVE(Bricks)) %>% 
  forecast(h = 20) %>% 
  autoplot(aus_production)

## Warning: Removed 20 row(s) containing missing values (geom_path).

NSW Lambs (aus_livestock)

for NSW, I want to try both SNAIVE(y) or RW(y ~ drift()) to see what the difference is.

Like the Bricks (aus_production), the data is seasonal, so SNAIVE(y) is a better option.

nswlambs <- 
  aus_livestock %>% 
  filter(Animal == 'Lambs' & 
         State == 'New South Wales')

nswlambs %>%
  model(drift = RW(Count ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(nswlambs)

nswlambs %>% 
  drop_na(Count) %>% 
  model(snaive = SNAIVE(Count)) %>% 
  forecast(h = 20) %>% 
  autoplot(nswlambs)

Household wealth (hh_budget).

RW(y ~ drift()) is more suitable since the overall trend is rising over the year, so we want to see the average change seen in the historical data.

hh_budget %>% 
  autoplot(Wealth)

hh_budget %>%
  model(drift = RW(Wealth ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(hh_budget)

Australian takeaway food turnover (aus_retail).

I have the same thought like the Household wealth, overall it is rising trend, so i used the same fuction.

austakeaway = aus_retail %>% filter( Industry == 'Takeaway food services', State == "Australian Capital Territory") %>% 
  select(State, Industry, Month, Turnover)

autoplot(austakeaway)

## Plot variable not specified, automatically selected `.vars = Turnover`

austakeaway %>%
  model(drift = RW(Turnover ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(austakeaway)

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

I want to have a look of the data set and see what the symbol is for facebook. it is for later use.

then we create a plot by using “close” column.

summary(gafa_stock)

##     Symbol               Date                 Open              High        
##  Length:5032        Min.   :2014-01-02   Min.   :  54.02   Min.   :  54.94  
##  Class :character   1st Qu.:2015-04-02   1st Qu.: 118.33   1st Qu.: 119.25  
##  Mode  :character   Median :2016-06-30   Median : 257.59   Median : 261.94  
##                     Mean   :2016-07-01   Mean   : 465.75   Mean   : 469.95  
##                     3rd Qu.:2017-09-29   3rd Qu.: 746.53   3rd Qu.: 750.96  
##                     Max.   :2018-12-31   Max.   :2038.11   Max.   :2050.50  
##       Low              Close           Adj_Close           Volume         
##  Min.   :  51.85   Min.   :  53.53   Min.   :  53.53   Min.   :     7900  
##  1st Qu.: 117.35   1st Qu.: 118.54   1st Qu.: 115.48   1st Qu.:  2519975  
##  Median : 256.89   Median : 259.51   Median : 258.61   Median : 10804400  
##  Mean   : 460.92   Mean   : 465.56   Mean   : 464.24   Mean   : 19493800  
##  3rd Qu.: 738.01   3rd Qu.: 744.79   3rd Qu.: 744.79   3rd Qu.: 29399250  
##  Max.   :2013.00   Max.   :2039.51   Max.   :2039.51   Max.   :266380800

unique(gafa_stock$Symbol)

## [1] "AAPL" "AMZN" "FB"   "GOOG"

fb_stock <- 
  gafa_stock %>% 
  filter(Symbol == 'FB')

fb_stock %>% autoplot(Close)

Produce forecasts using the drift method and plot them.

i select 2015 to 2017 data and create drift plot and forecast the future 20 days trend.

fb_forecast <-
  fb_stock %>% 
  fill_gaps() %>% 
  filter_index("2015-01-01" ~ "2017-01-01") %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE) %>%
  select(Date, Close)

autoplot(fb_forecast)

## Plot variable not specified, automatically selected `.vars = Close`

fb_forecast %>%
  model(drift = RW(Close ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(fb_forecast)

Show that the forecasts are identical to extending the line drawn between the first and last observations.

I added naive model.and we can see naive is more flat.

fb_forecast %>%
  model(
    Naïve = NAIVE(Close),
    Drift = RW(Close ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(fb_forecast)

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

i added mean, however, we can see it is far off from the trend. and it is not really reason for flat line like navie.

so I would say drift is the best model by far.

# ref from 5.8


fb_forecast %>%
  model(
    Mean = MEAN(Close),
    `Naïve` = NAIVE(Close),
    Drift = RW(Close ~ drift())) %>% 
  forecast(h = 20) %>% 
  autoplot(fb_forecast)

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

from the test result of augment (Q and Q∗), it is distinguishable from white noise since p value is very small amd from the acf plot, 4 has higher value. Also, we notice how the coefficient is high at lag 4

#Extract data of interest
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
#Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
#Look at the residuals
fit |> gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

#Look a some forecasts
fit |> forecast() |> autoplot(recent_production)

#ref.5.4

fit %>%
  augment() %>% 
  features(.innov, box_pierce, lag = 10)

## # A tibble: 1 × 3
##   .model       bp_stat bp_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    34.4  0.000160

fit %>%
  augment()%>% features(.innov, ljung_box, lag = 10)

## # A tibble: 1 × 3
##   .model       lb_stat lb_pvalue
##   <chr>          <dbl>     <dbl>
## 1 SNAIVE(Beer)    37.8 0.0000412

#What do you conclude?

#

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

I think NAIVE model is more suitable since it is annual data because this method works remarkably well for many economic and financial time series. The plot shows lag 1 has autocorrelation.

ausecon <-
  global_economy %>% 
  filter(Country == "Australia")

ausecon %>%
  autoplot(Exports)

aus_export_model <-
  ausecon %>% 
  model(naive = NAIVE(Exports))

aus_export_model %>% gg_tsresiduals()

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

aus_export_model %>% 
  forecast(h = 10) %>% 
  autoplot(ausecon)

5.7

For your retail time series (from Exercise 8 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(1234)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries_train <- myseries |>
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit <- myseries_train |>
  model(SNAIVE(Turnover))

Check the residuals.

fit |> gg_tsresiduals()

## Warning: Removed 12 row(s) containing missing values (geom_path).

## Warning: Removed 12 rows containing missing values (geom_point).

## Warning: Removed 12 rows containing non-finite values (stat_bin).

Do the residuals appear to be uncorrelated and normally distributed?

The ACF plot shows it has autocorrelation in the data, The slow decrease in the ACF as the lags increase is due to the trend, while the “scalloped” shape is due to the seasonality. When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in value. So the ACF of a trended time series tends to have positive values that slowly decrease as the lags increase.

Produce forecasts for the test data

fc <- fit |>
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")

fc |> autoplot(myseries)

Compare the accuracy of your forecasts against the actual values.

fit |> accuracy()

## # A tibble: 1 × 12
##   State    Industry .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tasmania Cafes, … SNAIV… Trai…  1.33  2.90  2.22  6.31  10.7     1     1 0.800

fc |> accuracy(myseries)

## # A tibble: 1 × 12
##   .model  State Industry   .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>   <chr> <chr>      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE… Tasm… Cafes, re… Test   7.12  9.13  7.58  13.2  14.4  3.42  3.15 0.863

How sensitive are the accuracy measures to the amount of training data used?

The accuracy() function will automatically extract the relevant periods from the data (recent_production in this example) to match the forecasts when computing the various accuracy measures. I think it is sensitive on the data size, i think the data increase should increase the accuracy as well.

data624-assigment3

Jay Lee

2/19/2023