|> filter(Country == 'Australia') |> autoplot(Population) + ggtitle('Australian Population') global_economy
HW3DATA624
Assignment
We are required to complete questions 5.1, 5.2, 5.3, 5.4, and 5.7 from chapter 5 of “Forecasting: Principles and Practice” Third Edition by Rob Hyndman and George Athanasopoulos.
5.1
We are tasked with using NAIVE(y)
, SNAIVE(y)
or RW(y ~ drift())
, which ever is most appropriate, for a variety of time series
Australian Population
First, I plot the data
The time series doesn’t have visible seasonality and has a positive upwards trend. The Drift method, which is equivalent to drawing a line between the first and last observation and extending it forward, may be most appropriate.
<- global_economy |> filter(Country == 'Australia')
aus_pop
<- aus_pop |>
aus_pop_fit model(Drift = NAIVE(Population ~ drift()))
<- aus_pop_fit |>
aus_pop_forecast forecast(h = '5 years')
|> autoplot(aus_pop) + ggtitle('Australian Population plus five year forecast with confidence interval') aus_pop_forecast
Bricks
First, I plot the data
|> select(Bricks) |> autoplot() aus_production
Plot variable not specified, automatically selected `.vars = Bricks`
Warning: Removed 20 rows containing missing values or values outside the scale range
(`geom_line()`).
The time series has seasonality and both trend and cyclicity. Given the seasonality, the Seasonal Naive method seems most appropriate, though capturing a potential change in trend would be difficult.
<- aus_production |> select(Bricks) |> drop_na()
bricksData
<- bricksData |>
bricks_fit model(Seaonal_Naive = SNAIVE(Bricks))
<- bricks_fit |>
bricks_forecast forecast(h = '2 years')
|> autoplot(bricksData) + ggtitle('Australian Brick Production plus two year forecast with confidence interval') bricks_forecast
NSW Lambs
First, I plot the data
|> filter(Animal == 'Lambs', State == 'New South Wales') |> autoplot() aus_livestock
Plot variable not specified, automatically selected `.vars = Count`
There is both trend-cyclicity and seasonality in the data, so I will try the seasonal naive approach
<- aus_livestock |> filter(Animal == 'Lambs', State == 'New South Wales')
lambData
<- lambData |>
lamb_fit model(Seaonal_Naive = SNAIVE(Count))
<- lamb_fit |>
lamb_forecast forecast(h = '2 years')
|> autoplot(lambData) + ggtitle('New South Wales Lamb Count plus two year forecast with confidence interval') lamb_forecast
Household wealth
First, I plot the data. I will choose to focus on USA household wealth
|> filter(Country == 'USA') |> autoplot(Wealth) hh_budget
There is an upward trend with cyclicity and no evident seasonality since the data is annual. The random walk drift method seems most appropriate.
<- hh_budget |> filter(Country == 'USA')
us_wealth
<- us_wealth |>
us_wealth_fit model(Drift = NAIVE(Wealth ~ drift()))
<- us_wealth_fit |>
us_wealth_forecast forecast(h = '5 years')
|> autoplot(us_wealth) + ggtitle('US household wealth plus five year forecast with confidence interval') us_wealth_forecast
Australian takeaway food turnover
First, I plot the data. I choose to focus on Western Australia
|> filter(State == 'Western Australia', Industry == 'Takeaway food services') |> autoplot(Turnover) aus_retail
I check to see whether a log transformation makes the variance somewhat more consistent
|> filter(State == 'Western Australia', Industry == 'Takeaway food services') |> autoplot(log(Turnover)) aus_retail
The variance here seems more consistent, so I will go with the log data. I will go with a naive seasonal model.
<- aus_retail |> filter(State == 'Western Australia', Industry == 'Takeaway food services')
west_aus_takeaway
$logTakeaway <- log(west_aus_takeaway$Turnover)
west_aus_takeaway
<- west_aus_takeaway |>
west_aus_takeaway_fit model(Seaonal_Naive = SNAIVE(logTakeaway))
<- west_aus_takeaway_fit |>
west_aus_takeaway_forecast forecast(h = '2 years')
|> autoplot(west_aus_takeaway) + ggtitle('Western Australia Takeaway Food Services turnover plus two year forecast with confidence interval') west_aus_takeaway_forecast
5.2
We are tasked on analyzing data on Facebook stock price data from the gafa_stock data set
A
Below I produce a plot of the time series. I will plot the Adj_Close price
|> filter(Symbol=='FB') |> autoplot(Adj_Close) gafa_stock
As discussed in the book, the Index for the time series needs to be adjusted as the data is not daily since there is no data for days when the market is closed.
<- gafa_stock |> filter(Symbol=='FB') |> mutate(day = row_number()) |>
fb_data update_tsibble(index = day, regular = TRUE)
|> autoplot(Adj_Close) fb_data
B
Below I produce a forecast for the next 20 days using the drift method
<- fb_data |>
fb_fit model(Drift = RW(Adj_Close~drift()))
<- fb_fit |> forecast(h = 20)
fb_forecast
|> autoplot(fb_data) + ggtitle('FB Adj_Close price plus twenty forecast via drift method') fb_forecast
C
We are tasked with showing that the forecast above is equivalent to drawing a line between the first and last points in the time series.
<- fb_data[c(1,nrow(fb_data)),]
subset_FB_Data
|> autoplot(fb_data) +
fb_forecast geom_polygon(data = subset_FB_Data, aes(x= day, y = Adj_Close), color = 'red') +
ggtitle('FB Adj_Close price plus twenty forecast via drift method')
As we can see in the above plot, the red line is the line between the first and last values of Adj_Close for FB and it connects to the blue line, which is the forecast of Adj_Close.
D
I will try the naive method to forecast the Adj_Close for FB
<- fb_data |>
fb_fit_NAIVE model(naive = NAIVE(Adj_Close ))
<- fb_fit_NAIVE |> forecast(h = 20)
fb_forecast_NAIVE
|> autoplot(fb_data, , level = F) + ggtitle('FB Adj_Close price plus twenty forecast via NAIVE method') fb_forecast_NAIVE
Warning: Plot argument `level` should be a numeric vector of levels to display.
Setting `level = NULL` will remove the intervals from the plot.
The Naive method takes the last value in the time series and assumes that future values will be the same as the last observed value. The Drift method is equal to the last value plus the average change in value that has been seen in the time series. While it is possible that over a smaller window, such as one day, that a stock price remains unchanged, it is unlikely that a stock price will remain unchanged over multiple days into the future. Thus, the Drift method would be more appropriate for a longer term forecast because it incorporates some change in the price of the stock.
5.3
We are asked to use a seasonal naive method on quarterly Australian beer production data and check whether the residuals look like white noise.
# Extract data of interest
<- aus_production |>
recent_production filter(year(Quarter) >= 1992)
# Define and estimate a model
<- recent_production |> model(SNAIVE(Beer))
fit # Look at the residuals
|> gg_tsresiduals() fit
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).
# Look a some forecasts
|> forecast() |> autoplot(recent_production) fit
Based on the above residual diagnostic plot, we can see there is statistically significant negative correlation between the residuals at lag 4. Furthermore, the variance of the residuals is not constant. Based on this, the residuals do not seem to be white noise.
5.4
We are tasked with analyzing the residuals for two different time series
Australian exports
First, I plot the data to determine whether to use the NAIVE or SNAIVE model
|> filter(Country=='Australia') |> autoplot(Exports) global_economy
Based on the above, I will go with the NAIVE model, as the data is annual and thus there is no seasonality.
# Extract data of interest
<- global_economy |> filter(Country=='Australia')
aus_exports # Define and estimate a model
<- aus_exports |> model(NAIVE(Exports))
aus_exports_fit
# Look a some forecasts
|> forecast() |> autoplot(aus_exports) aus_exports_fit
# Look at the residuals
|> gg_tsresiduals() aus_exports_fit
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_bin()`).
Based on the residual diagnostics plot we can see that there is statically significant negative correlation at lag 1 and that the variance of the residuals is not constant. Thus, it seems like the residuals are not white noise.
Australian Bricks production
First, I plot the data to determine whether to use the NAIVE or SNAIVE model
|> select(Bricks) |> drop_na() |> autoplot(Bricks) aus_production
There does appear to be seasonality in the time series so I will go with the SNAIVE model
# Extract data of interest
<- aus_production |> select(Bricks) |> drop_na()
aus_bricks # Define and estimate a model
<- aus_bricks |> model(SNAIVE(Bricks))
aus_bricks_fit
# Look a some forecasts
|> forecast() |> autoplot(aus_bricks) aus_bricks_fit
# Look at the residuals
|> gg_tsresiduals() aus_bricks_fit
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_bin()`).
Based on the residual diagnostic plot we see that there are several statistically significant correlations at various lags and that variance among the residuals is not constant. Thus, it seems like the residuals are not white noise.
5.7
We are tasked with creating a forecast model on one of the time series of retail data from aus_retail. I will select ‘Liquor retailing’ for ‘New South Wales’
<- aus_retail |> filter(Industry=='Liquor retailing', State=='New South Wales')
aus_liquor_NSW
|> autoplot(Turnover) aus_liquor_NSW
A - training data set
<- aus_liquor_NSW |>
aus_liquor_NSW_train filter(year(Month) < 2011)
B - checking that data was split
autoplot(aus_liquor_NSW_train, Turnover) +
autolayer(aus_liquor_NSW_train, Turnover, colour = "red")
The data does indeed end before 2011.
tail(aus_liquor_NSW_train)
# A tsibble: 6 x 5 [1M]
# Key: State, Industry [1]
State Industry `Series ID` Month Turnover
<chr> <chr> <chr> <mth> <dbl>
1 New South Wales Liquor retailing A3349627V 2010 Jul 207.
2 New South Wales Liquor retailing A3349627V 2010 Aug 213.
3 New South Wales Liquor retailing A3349627V 2010 Sep 224.
4 New South Wales Liquor retailing A3349627V 2010 Oct 239.
5 New South Wales Liquor retailing A3349627V 2010 Nov 251.
6 New South Wales Liquor retailing A3349627V 2010 Dec 380
C - fitting an SNAIVE model
<- aus_liquor_NSW_train |>
aus_liquor_NSW_fit model(SNAIVE(Turnover))
aus_liquor_NSW_fit
# A mable: 1 x 3
# Key: State, Industry [1]
State Industry `SNAIVE(Turnover)`
<chr> <chr> <model>
1 New South Wales Liquor retailing <SNAIVE>
D - checking the residuals
|> gg_tsresiduals() aus_liquor_NSW_fit
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 12 rows containing non-finite outside the scale range
(`stat_bin()`).
There is statistically significant correlation at several lags and the variance of the residuals is not constant - the residuals are not white noise.
E - producing forecasts
<- aus_liquor_NSW_fit |>
aus_liquor_NSW_forecast forecast(new_data = anti_join(aus_liquor_NSW, aus_liquor_NSW_train))
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
|> autoplot(aus_liquor_NSW) aus_liquor_NSW_forecast
The Forecasts are lower than the actual values - the forecast does not capture the increasing trend in the time series
F - comparing accuracy of forecasts vs actual values
|> accuracy() aus_liquor_NSW_fit
# A tibble: 1 × 12
State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New Sou… Liquor … SNAIV… Trai… 6.87 12.3 8.89 5.48 7.78 1 1 0.662
Using the MAPE measure, which converts the errors to percentage terms, we see there is a 7.8% error for the forecast vs actual values in the training set. This does not seem like a high error rate, though it is difficult to say in a vacuum as the tolerance for errors is case dependent.
|> accuracy(aus_liquor_NSW) aus_liquor_NSW_forecast
# A tibble: 1 × 12
.model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SNAIVE(T… New … Liquor … Test 47.6 51.3 47.6 16.8 16.8 5.35 4.18 0.741
The MAPE is 16.8% when predicting values in the test set and comparing them to actual values. This is higher than in the training set but still seems somewhat tolerable, though again the tolerance for error is case dependent and difficult to say in a vacuum.
G - How sensitive are the accuracy measures to the amount of training data used?
To test this, I will repeat the above exercise but with a a smaller training set
<- aus_liquor_NSW |>
aus_liquor_NSW_train_small filter(year(Month) < 2000)
<- aus_liquor_NSW_train_small |>
aus_liquor_NSW_fit_small model(SNAIVE(Turnover))
<- aus_liquor_NSW_fit_small |>
aus_liquor_NSW_forecast_small forecast(new_data = anti_join(aus_liquor_NSW, aus_liquor_NSW_train_small))
Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
|> autoplot(aus_liquor_NSW) aus_liquor_NSW_forecast_small
|> accuracy() aus_liquor_NSW_fit_small
# A tibble: 1 × 12
State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New Sou… Liquor … SNAIV… Trai… 3.42 8.75 6.13 4.36 7.80 1 1 0.742
|> accuracy(aus_liquor_NSW) aus_liquor_NSW_forecast_small
# A tibble: 1 × 12
.model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SNAIVE(T… New … Liquor … Test 116. 135. 116. 48.4 48.4 19.0 15.4 0.898
The MAPE for the errors in the test training data increases to 48.3% when using a smaller training set versus a MAPE of 16.8% in the test set when using a larger training set. Furthermore, the plot showing the actual values versus the forecast shows the larger divergence between forecast and actual values - the model is not capturing the upward trend and the less training data, the more significant this lack of trend becomes. Model accuracy is higher with a larger training set; however, we also need to be cautious to avoid over-fitting the model as that can also lead to poor out of sample performance.