5.1

Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:

Australian Population (global_economy)

global_economy %>% 
  filter(Country =="Australia") %>% 
  autoplot(Population)

Data has trend and no seasonality. A drift model is appropriate.

global_economy %>% 
  filter(Country =="Australia") %>% 
  model(RW(Population ~drift())) %>% 
  forecast(h="10 years") %>% 
  autoplot(global_economy)

Bricks (aus_production)

aus_production %>%
  filter(!is.na(Bricks)) %>% 
  autoplot(Bricks)

This data appears to have more seasonality than trend, so of the models available,seasonal naive is most appropriat

aus_production %>%
  filter(!is.na(Bricks)) %>% 
  model(SNAIVE(Bricks)) %>% 
  forecast(h="5 years") %>% 
  autoplot(aus_production)

## Warning: Removed 20 row(s) containing missing values (geom_path).

NSW Lambs (aus_livestock)

aus_livestock %>%
  filter(State=="New South Wales" & Animal=="Lambs") %>%
  autoplot(Count)

Lamb1<-aus_livestock %>%
  filter(State=="New South Wales" & Animal=="Lambs") 


#fit
Lamb1fit <- Lamb1 %>%
  model(
    Mean = MEAN(Count),
    `Seasonal Naïve` = SNAIVE(Count),
    `Naïve` = NAIVE(Count),
    Drift= NAIVE(Count~drift()))

Downward then steady trend, I will use naive would be appropriate with arguably mean model as a contender.

Lamb1fit %>% 
  forecast(h = 14)%>%
  autoplot(Lamb1)

Household wealth (hh_budget).

hh_budget %>%
  filter(Country=="Australia") %>%
  autoplot(Wealth)

Australia1<-hh_budget %>%
  filter(Country=="Australia")


#fit
AusW_fit <- Australia1 %>%
  model(
    Mean = MEAN(Wealth),
    `Naïve` = NAIVE(Wealth),
    `Seasonal Naïve` = SNAIVE(Wealth),
    Drift= NAIVE(Wealth~drift())
  )

## Warning: 1 error encountered for Seasonal Naïve
## [1] Non-seasonal model specification provided, use RW() or provide a different lag specification.

AusW_fit %>% 
  forecast(h = 14)%>%
  autoplot(Australia1)

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

## Warning: Removed 14 row(s) containing missing values (geom_path).

Australian takeaway food turnover (aus_retail).

takeaway <- aus_retail %>%
  filter(Industry == "Takeaway food services") %>% 
  summarise(Turnover=sum(Turnover))

takeaway  %>% autoplot(Turnover)

This data has strong seasonality and strong trend, so we will use a seasonal naivemodel with drift

takeaway  %>% 
  model(SNAIVE(Turnover ~drift())) %>% 
  forecast(h= "5 years") %>% 
  autoplot(takeaway)

5.2

Use the Facebook stock price (data set gafa_stock) to do the following:

Produce a time plot of the series.

fb_stock <- gafa_stock %>% 
  filter(Symbol == "FB")%>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)

fb_stock %>% autoplot(Close)

Produce forecasts using the drift method and plot them.

# Filter the year of interest
fb_2015 <- fb_stock %>% filter(year(Date) == 2015)
# Fit the models
fb_fit <- fb_2015 %>%
  model(
    `Naïve Drift` = NAIVE(Close~drift()),
    `RW Drift` = RW(Close ~ drift())
  )
# Produce forecasts for the trading days in January 2016
fb_jan_2016 <- fb_stock %>%
  filter(yearmonth(Date) == yearmonth("2016 Jan"))
fb_fc <- fb_fit %>%
  forecast(new_data = fb_jan_2016)

# Plot the forecasts
fb_fc %>%
  autoplot(fb_2015, level = NULL) +
  autolayer(fb_jan_2016, Close, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily closing stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))

Show that the forecasts are identical to extending the line drawn between the first and last observations

how that the forecasts are identical to extending the line drawn between the first and last observations.

This implies the slope of the line between the first and last point is equivalent to the slope of the line between the forecast first point( last point of the dataset used) and the any forecast point (all forecast points are on the same line.

Obtain the points and compare the slopes.

Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

#Fit the models
fb_fit2 <- fb_2015 %>%
  model(
    Mean = MEAN(Close),
    `Naïve` = NAIVE(Close),
    Drift = NAIVE(Close ~ drift())
  )

#Produce forecasts for January 2016
fb_fc2 <- fb_fit2 %>%
  forecast(new_data = fb_jan_2016)

#Re-plot forecasts vs. actual
fb_fc2 %>%
  autoplot(fb_2015, level = NULL) +
  autolayer(fb_jan_2016, Close, colour = "black") +
  labs(y = "$US",
       title = "Facebook daily close stock prices",
       subtitle = "(Jan 2015 - Jan 2016)") +
  guides(colour = guide_legend(title = "Forecast"))

Above the mean model and Naive model are used as forecasts. The naive model is better because the drift model captures the upward trend and not the downward seemingly randowm variability seen throughout.

5.3

Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.

# Extract data of interest
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production %>% model(SNAIVE(Beer))
# Look at the residuals
fit %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit %>% forecast() %>% autoplot(recent_production)

What do you conclude?

The forecast plot is visually acceptable. Next, a review of the residual plots.

The residual plots seem to indicate the forecast accounts for all information. The time plot shows the same variability across time and are constant. The histogram is symmetric and normal looking with a mean of 0.

The acf plot does imply an investigation for lag 4 is warranted, the model may need adjustment. The box pierce test is signficance, different than white noise.

5.4

Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.

# Extract data of interest
Aus_production <- global_economy %>%
  filter(Country=="Australia")

# Define and estimate a model
fit2 <-Aus_production %>% model(NAIVE(Exports))

# Look at the residuals
fit2 %>% gg_tsresiduals()

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_bin).

#residual
aug2<-Aus_production %>% model(NAIVE(Exports)) %>%
  augment()
# Look a some forecasts
fit2 %>% forecast() %>% autoplot(Aus_production)

#
#SNAIVE
# Define and estimate a model
fit3 <-Aus_production %>% model(SNAIVE(Exports~lag(4)))
# Look at the residuals
fit3 %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

# Look a some forecasts
fit3 %>% forecast() %>% autoplot(Aus_production)

aug2 %>% features(.innov, box_pierce, lag=8, dof=0)

## # A tibble: 1 × 4
##   Country   .model         bp_stat bp_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    13.8    0.0858

aug2 %>% features(.innov, ljung_box, lag=8, dof=0)

## # A tibble: 1 × 4
##   Country   .model         lb_stat lb_pvalue
##   <fct>     <chr>            <dbl>     <dbl>
## 1 Australia NAIVE(Exports)    15.5    0.0508

Based on the residual diagnostic plots, the NAIVE model is superior. The acf plot is displays non-significant correlations and the residual histogram appears symmetric and normally distributed. Both the box-pierce and ljung_box test are not significant at the .05 level indicating the residuals are not distinguishable from white noise.

Meanwhile, the SNAIVE residual diagnostic plots show a couple of significant autocorrelation at lag1,4. The histogram is left skewed, not normal. The residual over time does show more variability than the NAIVE.

The NAIVE model is more appropriate.

5.7

For your retail time series (from Exercise 8 in Section 2.10):

Create a training dataset consisting of observations before 2011 using

set.seed(1111111)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))
autoplot(myseries,.vars=Turnover)

myseries_train <- myseries %>%
  filter(year(Month) < 2011)

Check that your data have been split appropriately by producing the following plot.

autoplot(myseries, Turnover) +
  autolayer(myseries_train, Turnover, colour = "red")

Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).

fit2 <- myseries_train %>%
  model(SNAIVE(Turnover~lag(4)))

Check the residuals.

fit2 %>% gg_tsresiduals()

## Warning: Removed 4 row(s) containing missing values (geom_path).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing non-finite values (stat_bin).

fit2 %>% forecast() %>% autoplot(myseries_train)

Do the residuals appear to be uncorrelated and normally distributed? There are problems with all three residual diagnostic plots. In the first plot, the residuals over time depict an increasing variability. The acf plot of residuals display several significant autocorrelations. The histogram of residuals shows a left skew.

Produce forecasts for the test data

fc <- fit2 %>%
  forecast(new_data = anti_join(myseries, myseries_train))

## Joining, by = c("State", "Industry", "Series ID", "Month", "Turnover")

fc %>% autoplot(myseries)

accuracy of your forecasts against the actual values.

Compare the accuracy of your forecasts against the actual values.

fit2 %>% accuracy()

## # A tibble: 1 × 12
##   State  Industry  .model .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>     <chr>  <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Weste… Newspape… SNAIV… Trai… 0.244  5.56  3.52 -0.569  14.2 0.859 0.981 0.449

fc %>% accuracy(myseries)

## # A tibble: 1 × 12
##   .model   State Industry  .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>    <chr> <chr>     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(… West… Newspape… Test   4.51  9.52  7.77  9.18  21.3  1.90  1.68 0.528

accuracy of your forecasts against the actual values.

How sensitive are the accuracy measures to the amount of training data used?

The accuracy measures are quite sensitive to the amount of training data used. In our example, the training set had the quite a large amount of points and based on all (but the ACF1) measures it performed poorly in comparison to the test data which had fewer points. Having too many points leads to the possibility of introducing irrelevant data to forecasts, while not having enough data leads to under training and missing out on potential long term patterns needed for accurate forecasts.

HW3_624 Predictive Analytics

Karim Hammoud

5.1

5.2

5.3

5.4

5.7