5.1, 5.2, 5.3, 5.4 and 5.7
Produce forecasts for the following series using whichever of NAIVE(y), SNAIVE(y) or RW(y ~ drift()) is more appropriate in each case:
Naïve Bayes Classification is a type of probability classifier that applies Bayes’ theorem, which assumes independence between predictors. Naive Bayes classification remains a popular method for the problem of determining a document into one of several categories (e.g. spam, sports, politics) by being used in text classification. In Naise Bayes classification, Bayes’ theorem is used and each data is assumed to be an independent event.
The snaive() function assumes the last value of the same season in the previous year for future forecast values.
Random walk forecast may be appropriate when the data is in a random walk pattern, is relatively stable, and long-term trends are observed.
Australian Population (global_economy)
Utilize Random Walk forecasting due to the necessity of observing long-term trends.
australia_data <- global_economy %>%
filter(Country == "Australia")
missing <- any(is.na(australia_data))
missing
## [1] TRUE
australia_data %>%
model(RW(Population ~ drift())) %>%
forecast(h = 10) %>%
autoplot(australia_data) +
labs(title = "Australia Population Growth using Random Walk Forcast")
Bricks (aus_production)
The snaive() function is used because the data exhibits seasonality.
#aus_production
bricks_data <- aus_production %>%
select(Bricks)
missing <- any(is.na(bricks_data))
missing
## [1] TRUE
bricks_data %>%
filter(!is.na(Bricks)) %>%
model(SNAIVE(Bricks)) %>%
forecast(h = 10) %>%
autoplot(bricks_data) +
labs(title = "Bricks Production Forecast using SNAIVE Model")
NSW Lambs (aus_livestock)
Using the NAIVE method seems appropriate since each data point is independent.
aus_livestock
## # A tsibble: 29,364 x 4 [1M]
## # Key: Animal, State [54]
## Month Animal State Count
## <mth> <fct> <fct> <dbl>
## 1 1976 Jul Bulls, bullocks and steers Australian Capital Territory 2300
## 2 1976 Aug Bulls, bullocks and steers Australian Capital Territory 2100
## 3 1976 Sep Bulls, bullocks and steers Australian Capital Territory 2100
## 4 1976 Oct Bulls, bullocks and steers Australian Capital Territory 1900
## 5 1976 Nov Bulls, bullocks and steers Australian Capital Territory 2100
## 6 1976 Dec Bulls, bullocks and steers Australian Capital Territory 1800
## 7 1977 Jan Bulls, bullocks and steers Australian Capital Territory 1800
## 8 1977 Feb Bulls, bullocks and steers Australian Capital Territory 1900
## 9 1977 Mar Bulls, bullocks and steers Australian Capital Territory 2700
## 10 1977 Apr Bulls, bullocks and steers Australian Capital Territory 2300
## # ℹ 29,354 more rows
Lambs_data <- aus_livestock %>%
filter(State == "New South Wales", Animal == "Lambs")
Lambs_data
## # A tsibble: 558 x 4 [1M]
## # Key: Animal, State [1]
## Month Animal State Count
## <mth> <fct> <fct> <dbl>
## 1 1972 Jul Lambs New South Wales 587600
## 2 1972 Aug Lambs New South Wales 553700
## 3 1972 Sep Lambs New South Wales 494900
## 4 1972 Oct Lambs New South Wales 533500
## 5 1972 Nov Lambs New South Wales 574300
## 6 1972 Dec Lambs New South Wales 517500
## 7 1973 Jan Lambs New South Wales 562600
## 8 1973 Feb Lambs New South Wales 426900
## 9 1973 Mar Lambs New South Wales 496300
## 10 1973 Apr Lambs New South Wales 496000
## # ℹ 548 more rows
Lambs_data %>%
model(NAIVE(Count)) %>%
forecast(h = 10) %>%
autoplot(Lambs_data) +
labs(title = "Lambs Count inNew South Wales")
Household wealth (hh_budget).
Random Walk forecasting is used to observe long-term trends.
hh_budget
## # A tsibble: 88 x 8 [1Y]
## # Key: Country [4]
## Country Year Debt DI Expenditure Savings Wealth Unemployment
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 1995 95.7 3.72 3.40 5.24 315. 8.47
## 2 Australia 1996 99.5 3.98 2.97 6.47 315. 8.51
## 3 Australia 1997 108. 2.52 4.95 3.74 323. 8.36
## 4 Australia 1998 115. 4.02 5.73 1.29 339. 7.68
## 5 Australia 1999 121. 3.84 4.26 0.638 354. 6.87
## 6 Australia 2000 126. 3.77 3.18 1.99 350. 6.29
## 7 Australia 2001 132. 4.36 3.10 3.24 348. 6.74
## 8 Australia 2002 149. 0.0218 4.03 -1.15 349. 6.37
## 9 Australia 2003 159. 6.06 5.04 -0.413 360. 5.93
## 10 Australia 2004 170. 5.53 4.54 0.657 379. 5.40
## # ℹ 78 more rows
wealth_data <- hh_budget %>%
select(Wealth)
missing <- any(is.na(wealth_data))
missing
## [1] FALSE
wealth_data %>%
model(RW(Wealth ~ drift())) %>%
forecast(h = 10) %>%
autoplot(wealth_data) +
labs(title = "Household Wealth using Random Walk Forcast")
Australian takeaway food turnover (aus_retail).
Random Walk forecasting is used to observe long-term trends.
#aus_retail
turnover_data <- aus_retail %>%
filter(Industry == "Cafes, restaurants and takeaway food services")
turnover_data
## # A tsibble: 3,456 x 5 [1M]
## # Key: State, Industry [8]
## State Industry `Series ID` Month Turnover
## <chr> <chr> <chr> <mth> <dbl>
## 1 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Apr 7.6
## 2 Australian Capital Territory Cafes, restaurant… A3349606J 1982 May 6.7
## 3 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Jun 7.1
## 4 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Jul 7.5
## 5 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Aug 7.3
## 6 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Sep 8.1
## 7 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Oct 8.9
## 8 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Nov 9.6
## 9 Australian Capital Territory Cafes, restaurant… A3349606J 1982 Dec 11.2
## 10 Australian Capital Territory Cafes, restaurant… A3349606J 1983 Jan 7.7
## # ℹ 3,446 more rows
forecasted_data <- turnover_data %>%
model(RW(Turnover ~ drift())) %>%
forecast(h = 20)
forecasted_data %>%
autoplot(turnover_data) +
labs(title = "Australian Takeaway Food Turnover Forecast") +
facet_wrap(~State, scales = "free")
Use the Facebook stock price (data set gafa_stock) to do the following:
Produce a time plot of the series.
gafa_stock
## # A tsibble: 5,032 x 8 [!]
## # Key: Symbol [4]
## Symbol Date Open High Low Close Adj_Close Volume
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 2014-01-02 79.4 79.6 78.9 79.0 67.0 58671200
## 2 AAPL 2014-01-03 79.0 79.1 77.2 77.3 65.5 98116900
## 3 AAPL 2014-01-06 76.8 78.1 76.2 77.7 65.9 103152700
## 4 AAPL 2014-01-07 77.8 78.0 76.8 77.1 65.4 79302300
## 5 AAPL 2014-01-08 77.0 77.9 77.0 77.6 65.8 64632400
## 6 AAPL 2014-01-09 78.1 78.1 76.5 76.6 65.0 69787200
## 7 AAPL 2014-01-10 77.1 77.3 75.9 76.1 64.5 76244000
## 8 AAPL 2014-01-13 75.7 77.5 75.7 76.5 64.9 94623200
## 9 AAPL 2014-01-14 76.9 78.1 76.8 78.1 66.1 83140400
## 10 AAPL 2014-01-15 79.1 80.0 78.8 79.6 67.5 97909700
## # ℹ 5,022 more rows
facebook_stock <- gafa_stock %>%
filter(Symbol == "FB")
autoplot(facebook_stock, Close) +
labs(title = "Facebook Close Price")
Produce forecasts using the drift method and plot them.
tsibble is a useful package for handling time series data. The following code demonstrates the use of tsibble to manipulate Facebook stock price data, showcasing its ease of use in managing and analyzing time series data, particularly stock prices over time. tsibble provides a structured approach to working with time series data, enhancing data management and analysis capabilities.
The key point here is that when converting the facebook_stock dataframe to a tsibble using the as_tsibble() function, an error occurs due to missing dates in the original data. This is because tsibbles require regular intervals. Setting regular = TRUE will ensure that the tsibble has regular intervals and prevents errors. However, missing values are still not filled in. Therefore, using the fill_gaps() function fills in the missing values, resulting in a complete time series dataset without errors.
facebook_stock2 <- as_tsibble(facebook_stock, key = "Symbol", index = "Date", regular = TRUE) %>% fill_gaps()
drift_model <- facebook_stock2 %>%
model(Drift = RW(Close ~ drift())) %>%
forecast(h = 90)%>%
autoplot(facebook_stock) +
labs(title = "Facebook Close Price Forcast")
print(drift_model)
Show that the forecasts are identical to extending the line drawn between the first and last observations.
first_close <- facebook_stock$Close[1]
last_close <- tail(facebook_stock$Close, 1)
first_close
## [1] 54.71
last_close
## [1] 131.09
line_df <- data.frame(x1 = as.Date('2014-01-02'), x2 = as.Date('2018-12-31'), y1 = 54.71, y2 = 131.09)
drift_model <- facebook_stock2 %>%
model(Drift = RW(Close ~ drift())) %>%
forecast(h = 90)%>%
autoplot(facebook_stock) +
labs(title = "Facebook Close Price Forcast") +
geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour = "segment"), data = line_df)
print(drift_model)
Try using some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?
auto.arima function & forecast from forecast package.
When comparing drift methods and ARIMA models, ARIMA models are considered superior because they can capture more complex time series structures. The drift method, on the other hand, is used to model a trend that increases or decreases at a constant rate over time. While the drift method can be useful for time series data with relatively simple trends, it may not be suitable for data with more complex patterns that ARIMA models can handle.
auto.arima(facebook_stock2$Close)
## Series: facebook_stock2$Close
## ARIMA(0,1,0)
##
## sigma^2 = 5.112: log likelihood = -2961.13
## AIC=5924.26 AICc=5924.27 BIC=5929.77
facebook_stock2.arima <- arima(facebook_stock2$Close, order=c(1,1,1))
The selected model, ARIMA(1,1,1), will be used to forecast future numerical values.
facebook_stock2.forecast <- forecast(facebook_stock2.arima, h = 90)
Using the selected ARIMA model, future numerical values are predicted using the forecast function from the forecast package. The predictions are presented as a range of values. Refer to the graph for a visual representation.
plot(facebook_stock2.forecast)
The dark blue line represents the regression line, which is the point estimate. The widest light blue area (80%) and gray area (95%) correspond to the interval prediction values, indicating confidence intervals for the predictions.
Apply a seasonal naïve method to the quarterly Australian beer production data from 1992. Check if the residuals look like white noise, and plot the forecasts. The following code will help.
The seasonal naive method sets the future forecast value to the last observed value from the same season in the previous year, similar to the naive method.
# Extract data of interest
recent_production <- aus_production |>
filter(year(Quarter) >= 1992)
# Define and estimate a model
fit <- recent_production |> model(SNAIVE(Beer))
# Look at the residuals
fit |> gg_tsresiduals()
# Look a some forecasts
fit |> forecast() |> autoplot(recent_production)
auto_arima_Beer <- auto.arima(recent_production$Beer, trace=TRUE, test="adf", ic="aicc")
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 770.5697
## ARIMA(1,0,0) with non-zero mean : 771.9605
## ARIMA(0,0,1) with non-zero mean : 760.2114
## ARIMA(0,0,0) with zero mean : 1111.425
## ARIMA(1,0,1) with non-zero mean : 761.0401
## ARIMA(0,0,2) with non-zero mean : 754.1458
## ARIMA(1,0,2) with non-zero mean : 756.0566
## ARIMA(0,0,3) with non-zero mean : Inf
## ARIMA(1,0,3) with non-zero mean : 739.417
## ARIMA(2,0,3) with non-zero mean : Inf
## ARIMA(1,0,4) with non-zero mean : 720.5997
## ARIMA(0,0,4) with non-zero mean : 718.1571
## ARIMA(0,0,5) with non-zero mean : Inf
## ARIMA(1,0,5) with non-zero mean : Inf
## ARIMA(0,0,4) with zero mean : Inf
##
## Best model: ARIMA(0,0,4) with non-zero mean
summary(auto_arima_Beer)
## Series: recent_production$Beer
## ARIMA(0,0,4) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 ma4 mean
## -0.2380 -0.3937 -0.2247 0.6434 434.3864
## s.e. 0.1064 0.1218 0.1599 0.1104 2.5155
##
## sigma^2 = 833.3: log likelihood = -352.45
## AIC=716.9 AICc=718.16 BIC=730.73
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.8259476 27.87514 22.51016 -0.7114097 5.112844 0.4188737
## ACF1
## Training set -0.01507376
checkresiduals(auto_arima_Beer)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,4) with non-zero mean
## Q* = 70.178, df = 6, p-value = 3.758e-13
##
## Model df: 4. Total lags used: 10
What do you conclude?
Repeat the previous exercise using the Australian Exports series from global_economy and the Bricks series from aus_production. Use whichever of NAIVE() or SNAIVE() is more appropriate in each case.
In the ACF plot, the blue dotted line indicates the significance level for the correlation coefficient, indicating whether it is significantly different from zero. Normality is considered to be achieved when the correlation coefficient crosses this line. Looking at the ACF plot of the Exports series in the global_economy_aus dataset, we can infer that the time series data is close to a steady state and is predictable.
global_economy
## # A tsibble: 15,150 x 9 [1Y]
## # Key: Country [263]
## Country Code Year GDP Growth CPI Imports Exports Population
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan AFG 1960 537777811. NA NA 7.02 4.13 8996351
## 2 Afghanistan AFG 1961 548888896. NA NA 8.10 4.45 9166764
## 3 Afghanistan AFG 1962 546666678. NA NA 9.35 4.88 9345868
## 4 Afghanistan AFG 1963 751111191. NA NA 16.9 9.17 9533954
## 5 Afghanistan AFG 1964 800000044. NA NA 18.1 8.89 9731361
## 6 Afghanistan AFG 1965 1006666638. NA NA 21.4 11.3 9938414
## 7 Afghanistan AFG 1966 1399999967. NA NA 18.6 8.57 10152331
## 8 Afghanistan AFG 1967 1673333418. NA NA 14.2 6.77 10372630
## 9 Afghanistan AFG 1968 1373333367. NA NA 15.2 8.90 10604346
## 10 Afghanistan AFG 1969 1408888922. NA NA 15.0 10.1 10854428
## # ℹ 15,140 more rows
# Extract data of interest
global_economy_aus <- global_economy %>%
filter(Country == "Australia")
# Define and estimate a model
fit <- global_economy_aus %>% model(NAIVE(Exports))
# Look at the residuals
fit %>% gg_tsresiduals() +
ggtitle("Residual Plots for Australian Exports")
# Look at some forecasts
fit %>%
forecast() %>%
autoplot(global_economy_aus) +
ggtitle("Australian Exports Annually")
AR 모형
auto_arima_global_economy_aus <- auto.arima(global_economy_aus$Exports, trace=TRUE, test="adf", ic="aicc")
##
## ARIMA(2,1,2) with drift : Inf
## ARIMA(0,1,0) with drift : 189.2663
## ARIMA(1,1,0) with drift : 185.7032
## ARIMA(0,1,1) with drift : 182.1347
## ARIMA(0,1,0) : 187.9099
## ARIMA(1,1,1) with drift : 181.6751
## ARIMA(2,1,1) with drift : Inf
## ARIMA(1,1,2) with drift : Inf
## ARIMA(0,1,2) with drift : 182.2434
## ARIMA(2,1,0) with drift : 186.0136
## ARIMA(1,1,1) : 184.32
##
## Best model: ARIMA(1,1,1) with drift
summary(auto_arima_global_economy_aus)
## Series: global_economy_aus$Exports
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## 0.3931 -0.8459 0.1523
## s.e. 0.2449 0.1790 0.0457
##
## sigma^2 = 1.27: log likelihood = -86.45
## AIC=180.91 AICc=181.68 BIC=189.08
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02521336 1.087279 0.8844992 -0.6367124 5.362766 0.8983888
## ACF1
## Training set -0.01851169
checkresiduals(auto_arima_global_economy_aus)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,1) with drift
## Q* = 6.1328, df = 8, p-value = 0.6324
##
## Model df: 2. Total lags used: 10
aus_production
## # A tsibble: 218 x 7 [1Q]
## Quarter Beer Tobacco Bricks Cement Electricity Gas
## <qtr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1956 Q1 284 5225 189 465 3923 5
## 2 1956 Q2 213 5178 204 532 4436 6
## 3 1956 Q3 227 5297 208 561 4806 7
## 4 1956 Q4 308 5681 197 570 4418 6
## 5 1957 Q1 262 5577 187 529 4339 5
## 6 1957 Q2 228 5651 214 604 4811 7
## 7 1957 Q3 236 5317 227 603 5259 7
## 8 1957 Q4 320 6152 222 582 4735 6
## 9 1958 Q1 272 5758 199 554 4608 5
## 10 1958 Q2 233 5641 229 620 5196 7
## # ℹ 208 more rows
# Extract data of interest
bricks_data <- aus_production %>%
filter(!is.na(Bricks)) %>%
select(Bricks)
# Define and estimate a model
fit <- bricks_data %>% model(SNAIVE(Bricks))
# Look at the residuals
fit %>% gg_tsresiduals() +
ggtitle("Residual Plots for Bricks Production")
# Look at some forecasts
fit %>%
forecast() %>%
autoplot(bricks_data) +
ggtitle("Bricks Production")
auto_arima_bricks_data <- auto.arima(bricks_data$Bricks, trace=TRUE, test="adf", ic="aicc")
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 2373.113
## ARIMA(1,0,0) with non-zero mean : 2018.901
## ARIMA(0,0,1) with non-zero mean : 2216.674
## ARIMA(0,0,0) with zero mean : 2952.723
## ARIMA(2,0,0) with non-zero mean : 2021.714
## ARIMA(1,0,1) with non-zero mean : 2016.569
## ARIMA(2,0,1) with non-zero mean : Inf
## ARIMA(1,0,2) with non-zero mean : 2044.152
## ARIMA(0,0,2) with non-zero mean : 2092.951
## ARIMA(1,0,1) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(1,0,1) with non-zero mean : 2023.215
##
## Best model: ARIMA(1,0,1) with non-zero mean
summary(auto_arima_global_economy_aus)
## Series: global_economy_aus$Exports
## ARIMA(1,1,1) with drift
##
## Coefficients:
## ar1 ma1 drift
## 0.3931 -0.8459 0.1523
## s.e. 0.2449 0.1790 0.0457
##
## sigma^2 = 1.27: log likelihood = -86.45
## AIC=180.91 AICc=181.68 BIC=189.08
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.02521336 1.087279 0.8844992 -0.6367124 5.362766 0.8983888
## ACF1
## Training set -0.01851169
checkresiduals(auto_arima_bricks_data )
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,1) with non-zero mean
## Q* = 301.56, df = 8, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 10
For your retail time series (from Exercise 7 in Section 2.10):
set.seed(12345678)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
Create a training dataset consisting of observations before 2011 using
myseries_train <- myseries %>%
filter(year(Month) < 2011)
Check that your data have been split appropriately by producing the following plot.
autoplot(myseries, Turnover) +
autolayer(myseries_train, Turnover, colour = "red")
Fit a seasonal naïve model using SNAIVE() applied to your training data (myseries_train).
fit <- myseries_train %>%
model(SNAIVE(Turnover))
fit %>% gg_tsresiduals()
Do the residuals appear to be uncorrelated and normally distributed?
This indicates that there is initially strong autocorrelation among the residuals, but over time, the autocorrelation weakens and eventually disappears. This suggests that the assumption of a stationary time series is not met. It could imply that the model initially did not explain the data well, but its predictions became more accurate over time. Therefore, in this scenario, the residuals can be considered uncorrelated and to follow a normal distribution..
Produce forecasts for the test data
fc <- fit %>%
forecast(new_data = anti_join(myseries, myseries_train))
## Joining with `by = join_by(State, Industry, `Series ID`, Month, Turnover)`
fc %>% autoplot(myseries)
Compare the accuracy of your forecasts against the actual values.
fit %>% accuracy()
## # A tibble: 1 × 12
## State Industry .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Norther… Clothin… SNAIV… Trai… 0.439 1.21 0.915 5.23 12.4 1 1 0.768
fc %>% accuracy(myseries)
## # A tibble: 1 × 12
## .model State Industry .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(T… Nort… Clothin… Test 0.836 1.55 1.24 5.94 9.06 1.36 1.28 0.601
How sensitive are the accuracy measures to the amount of training data used?
Accuracy metrics can be sensitive to the amount of training data used. Generally, using more training data allows for more stable and reliable accuracy measures.
The predictive power of these models is assessed using metrics such as RMSE (smaller is better), MAE (smaller is better), MAPE (smaller is better), MASE (smaller is better), RMSSE (smaller than 1 is better), etc. These metrics are important indicators for evaluation.
By comparing the two tables, it is apparent that the model performs relatively better on the training data but has poorer prediction performance on the test data. This suggests that the model may be overfitting the training data, meaning it is overly tuned to the training data and performs poorly on new data.