#Create a training settrain <- my_series %>%filter(Month <yearmonth("2019 Jan"))#plot the time series with appropriate labelsplot = train %>%autoplot() +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 - 2020)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
Plot variable not specified, automatically selected `.vars = y`
show code
#Original plotplot
Very strong seasonality and trend component. handling this with seasonal differencing is required due to seasonality.First order differencing is required as well due to the linear trend. However, transformation is necessary to stabilize the variance, in order to create a simpler and consistent pattern for more accurate forecast.
I.II Initial Data Transformation Selection
show code
#Logarithmic TransformationLog = train %>%autoplot(log(y)) +xlab("Month") +ylab("Logarithmic Turnover") +ggtitle("Logarithmic Transformation") +theme_minimal()#Squared-Root TransformationSQR = train %>%autoplot(sqrt(y)) +xlab("Month") +ylab("Squared-Root of Turnover ") +ggtitle("Squared-Root Transformation") +theme_minimal()#Inference TransformationINF = train %>%autoplot(-1/ y) +xlab("Month") +ylab("Inference of Turnover ") +ggtitle("Inference Transformation") +theme_minimal()#calculate the lambda value through Guerrero methodlambda <- train %>%features(y, features = guerrero) %>%pull(lambda_guerrero)print(paste("Guerrero's value is", lambda))
[1] "Guerrero's value is 0.0228463936805025"
show code
#Plot the box cox transformation using guerrero as the lambda valuebox_cox = train %>%autoplot(box_cox(y, lambda)) +xlab("Month") +ylab("Box Cox of Turnover ") +ggtitle("Box Cox Transformation") +theme_minimal()#see the plot(box_cox + SQR ) / (Log + INF)
The lambda value from the Guerrero method suggest 0.0228 value, indicates 0 value would be the best in a box cox transformation. 0 lambda value would compute a natural logarithm instead. Furthermore, squared root still has sharp variances and the inference is inconsistent. Therefore, Log transformation would be better.
I.III Initial Data Differencing
show code
#transformationLog
show code
#does it need differencing?train %>%features(log(y), feat_stl)
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
The seasoned differencing did well by handling the seasonality component from the data. It looks fairly stationary, but indicates a few trending behaviour on the line graph. That is, there are still visible trend patterns or movement in the data.Therefore, following with first order difference would be sufficient.
show code
#first order differencing + logarithmic transformationtrain %>%autoplot(difference(log(y))) +theme_minimal()
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
The first order differencing sufficiently removes visible trending movement in the dataset. Looks somewhat stationary but the spikes from the plot indicates a strong seasonality component. Therefore, combining first order and seasoned differencing would be ideal to handle both time series components.
show code
#does it still need differencing?train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, feat_stl)
#0.0160 seasonal strength, below 0.64, no more. #check again.train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, unitroot_nsdiffs)
# A tibble: 1 × 1
nsdiffs
<int>
1 0
show code
#seasoned difference not required (n=0)train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, unitroot_ndiffs)
# A tibble: 1 × 1
ndiffs
<int>
1 0
show code
#first order difference not required (n=0)#that's enough.#plot the difference combinationtrain %>%autoplot(difference(difference(log(y),12),1)) +theme_minimal()
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Most stationary, first order and seasoned differencing has compliment both limitations well. Thus, successfully reducing the trending and seasonality component. No predictable patterns detected. However, there are few long spikes in the beginning but they appear aperiodic. Differencing stabilizes the mean and transformation stabilizes the variances. Select this.
II.I Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF)
show code
#first order differencing + transformation ACF and PACFtrain %>%mutate(y =difference(difference(log(y),12),1)) %>%gg_tsdisplay(plot_type ="partial", lag =24) +labs(title ="ACF plot with log transformation and first order + seasoned differencing", y="")
Plot variable not specified, automatically selected `y = y`
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 13 rows containing missing values or values outside the scale range
(`geom_point()`).
show code
#lag = 24, instead of lag = 36. 36 lags might add more parameters
Both plots has significant spikes at lag 12 and PACF lags on 12, 24, and 36 are decaying in significance indicating an MA(1) on seasonal ARIMA part. Before lag 12, ACF has significant lags on 1, 3, 7 while PACF are on lag 1, 2 and 7. AR(2) would be more parsimonious for the model than MA(3) while putting AR(7) or MA(7) is too complex. Therefore, ARIMA(2,1,0)(0,1,1) is sufficient.
#residuals is not white noise, sufficient evidence to reject at 95% confidence level#Generate the Box Pierce Valueaugment(fit)%>%features(.innov, box_pierce,lag =24, dof =4)
Residuals are approximately Gaussian, centered around 0. Innovation residuals seems homoskedastic, with constant variances that are centered around 0 as well. ACF shows significant spikes, thus, autocorrelation is present, not white noise. Both hypothesis test reject the null at 95% confidence with 4 degrees of freedom at lag 24.
III. Alternative ARIMA combinations
III.I Manual Selection
Consider four alternative ARIMA models. Use information criteria to choose the best model considered so far.
show code
#use up all possible alternativesfit_4= train %>%model(arima210_011 =ARIMA(log(y) ~pdq(2, 1, 0) +PDQ(0, 1, 1)),arima013_011 =ARIMA(log(y) ~pdq(0, 1, 3) +PDQ(0, 1, 1)),arima210_210 =ARIMA(log(y) ~pdq(2, 1, 0) +PDQ(2, 1, 0)),arima013_210 =ARIMA(log(y) ~pdq(0, 1, 3) +PDQ(2, 1, 0)))#View the Information Criterionglance(fit_4) %>%arrange(AICc) %>%select(.model:BIC)%>%gt() %>%# Add a title and subtitletab_header(title =md("**Information Criterion of ARIMA**"),subtitle =md("Using three different models") ) %>%# Format number columns with commasfmt_number(decimals =4 ) %>%# Add borders and styletab_style(style =cell_borders(sides ="all",color ="grey",weight =px(1) ),locations =cells_body() ) %>%# Apply bold style to headerstab_style(style =cell_text(weight ="bold"),locations =cells_column_labels(everything()) )
Error in tab_style(., style = cell_text(weight = "bold"), locations = cells_column_labels(everything())): could not find function "tab_style"
The Arima that was chosen at question 2 has the lowest AICc value, the ARIMA(2,1,0)(0,1,1)[12]. The rest have more parameters, which could suggest lower parameters might be more parsimonious and provides a better fit to the data relative to the complexity of the model.
III.II Automatic ARIMA() Selection
show code
#let R work harder to find all possible options.fit_r = train %>%model(arima_r =ARIMA(log(y), stepwise =FALSE, approx =FALSE, order_constraint = p + q + P + Q <=9 ))
Warning: Model specification induces a quadratic or higher order polynomial trend.
This is generally discouraged, consider removing the constant or reducing the number of differences.
show code
#make it extra hard for the ARIMA() function to explore #check the model selectionfit_r$arima_r
<lst_mdl[1]>
[1] <ARIMA(4,1,2)(1,1,2)[12]>
show code
#ARIMA(4,1,2)(1,1,2)[12]#combined to see it bettercombined_fit =cross_join(fit_r, fit_4)#select Q4 chosen model and ARIMA() modelcombined_fit[c("arima210_011","arima_r")] %>%glance() %>%arrange(AICc) %>%select(.model:BIC)
#residuals is not white noise, sufficient evidence to reject at 95% confidence level#Generate the Box Pierce Valueaugment(fit_r)%>%features(.innov, box_pierce,lag =24, dof =6)
R-select model is ordered ARIMA(4,1,2)(1,1,2)[12]. R-select is different by combining both seasonal and non seasonal AR and MA. Both models uses the same amount of differencing. Choose the R-select model due to lower AICc value.
Residuals are approximately Gaussian, centered around 0. Innovation residuals seems homoskedastic, with constant variances that are centered around 0 as well. ACF plot has no significant spikes at all, no autocorrelation present. The portmanteu test accepts the null hypothesis at 95% confidence level. It is white noise. This indicate a good model that sufficiently captured all variations in the data and its not biased.
III.III ARIMA Champion Model Selection
show code
#Comparison of their accuracy measurescombined_fit %>%forecast(h =24)%>%accuracy(my_series) %>%arrange(RMSE)
# A tibble: 5 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima013_210 Test 17.6 38.0 24.3 3.64 5.66 1.51 1.68 0.709
2 arima210_210 Test 18.5 38.8 24.9 3.88 5.79 1.54 1.72 0.714
3 arima013_011 Test 20.9 40.9 27.5 4.41 6.38 1.70 1.82 0.718
4 arima210_011 Test 21.9 41.7 27.9 4.65 6.47 1.73 1.85 0.720
5 arima_r Test 26.3 45.4 31.3 5.71 7.23 1.94 2.02 0.733
show code
#Comparison of AICC AIC and BICcombined_fit %>%glance() %>%arrange(AICc) %>%select(.model:AICc) %>%gt() %>%tab_header(title ="ARIMA model comparison",subtitle ="Sorted by Akaike Infomation Criterion (AIC)") %>% gt::tab_style(style =cell_text(weight ="bold"), locations =cells_column_labels(everything()))
ARIMA model comparison
Sorted by Akaike Infomation Criterion (AIC)
.model
sigma2
log_lik
AIC
AICc
arima_r
0.002789766
647.7865
-1275.573
-1275.045
arima210_011
0.002849818
640.4067
-1272.813
-1272.719
arima013_011
0.002860393
640.1332
-1270.266
-1270.124
arima210_210
0.003461398
604.7224
-1199.445
-1199.303
arima013_210
0.003480290
604.1067
-1196.213
-1196.014
show code
#All arima model have same amount of differencing
Interestingly, ARIMA(4,1,2)(1,1,2) performed the worse in the test set but has the lowest AICc. ARIMA(0,1,3)(2,1,0) has the lowest error values in all error measure but has the highest AICc. Nevertheless, ARIMA(4,1,2)(1,1,2) is chosen due to lower AICc as it reflects this model fit the best to the entire dataset.
IV. Forecast the next two years
show code
#whole datasetfit_r %>%forecast(h=24) %>%autoplot(my_series) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2016 - 2022) (Forecast to 2022)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
Zoom in
show code
#zoom in fit_r %>%forecast(h=24) %>%autoplot(my_series[ 420:465,]) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2016 - 2022) (Forecast to 2022)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
It shows reasonable mean over the next two years because the movement of the prediction intervals suggest it capture trend and seasonal. Prediction intervals are quite wide around the point forecast and suggest high degree of uncertainty especially for periods further in the future.
V. Whole data forecast
V.I Full Dataset Plot
show code
library(readxl)#read the data and perform data wranglingexcel =read_xlsx("Datasets/8501011.xlsx", sheet =2, skip =9)#create a tsibble datasetdf = excel %>%select("Series ID", A3349637X) %>%rename(Month =`Series ID`,y = A3349637X) %>%mutate(Month =yearmonth(Month)) %>%as_tsibble(index = Month)#plot the datadf %>%autoplot()+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 March)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
Plot variable not specified, automatically selected `.vars = y`
This dataset is a continuation of the previous monthly turnover dataset. The plot suggests that it might benefit during covid-19 as there is a visible jump on the spikes from 2020 and a slight increase overall onwards.
V.II Forecast using three different methods
show code
#Train train_df = df %>%slice(1:(n()-27)) #best benchmark, Best ETS and ARIMA)best_model = train_df %>%model(ETS =ETS(log(y) ~error("M") +trend("Ad") +season("M")),ARIMA =ARIMA(log(y) ~0+pdq(4,1,2) +PDQ(1,1,2)),SNAIVE =SNAIVE(log(y))) #plot on the whole data setbest_model %>%forecast(h =27) %>%autoplot(df) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
show code
#zoom inbest_model %>%forecast(h =27) %>%autoplot(df[430:492,], alpha =0.6)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2018 Jan - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
show code
#seperatebest_model %>%forecast(h =27) %>%autoplot(df[430:492,]) +facet_wrap(~.model)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2018 Jan - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
Out of all three models, it seems that the model for ETS is the closest from the actual data to the point forecast overall. However, The ETS Model has the widest confidence interval overall. Thus, ETS method has higher level of uncertainty in predicting the future values of the time series compared to the other methods. ETS and ARIMA models has capture the seasonality component and trend component. SNAIVE models use the seasonal pattern from historical data to directly forecast future values.
As SNAIVE is more sensitive to the past data, the forecast seem to do better due to Covid-19 increasing the turnover, allowing the actual data to meet a few of the forecast. For ETS model, it seems that the weight given to the past observations when forecasting future values align perfectly with the positive effects of Covid-19. The ARIMA method seems to perform the worse as the point forecast seem to slowly get farther away from the actual data over time.
V.III Accuracy Evaluation
show code
acc_table = best_model%>%forecast(h =27) %>%accuracy(df) %>%arrange(RMSE)#evaluate each model's accuracyacc_table %>%gt() %>%# Add a title and subtitletab_header(title =md("**Accuracy Metrics of Forecast**"),subtitle =md("Using three different models") ) %>%# Format number columns with commasfmt_number(decimals =4 ) %>%# Add borders and style gt::tab_style(style =cell_borders(sides ="all",color ="grey",weight =px(1) ),locations =cells_body() ) %>%# Apply bold style to headers gt::tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels(everything()) )
Accuracy Metrics of Forecast
Using three different models
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
ETS
Test
−13.9926
27.8895
21.1148
−3.2708
4.8783
1.2123
1.1129
0.2490
SNAIVE
Test
0.5025
31.3555
26.0714
0.3471
6.3103
1.4968
1.2513
0.5060
ARIMA
Test
−35.3641
44.1171
35.3641
−8.3509
8.3509
2.0304
1.7605
0.3795
ETS performed the best, it has the lowest RMSE, MAE, MPE and MAPE from all 3 models. Suggesting that the accuracy of the point forecast for ETS is the most accurate forecast as compare to Seasonal Naive method and ARIMA method for this dataset.
V.IV Two years ahead forecast evaluation
show code
#Create your final fit final_fit =df %>%model(ETS =ETS(log(y) ~error("M") +trend("Ad") +season("M")),ARIMA =ARIMA(log(y) ~0+pdq(4,1,2) +PDQ(1,1,2)),SNAIVE =SNAIVE(log(y))) #plot the forecast on the whole data setfinal_fit %>%forecast(h =24) %>%autoplot(df) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
show code
#zoom infinal_fit %>%forecast(h =24) %>%autoplot(df[454:492,], alpha =0.6)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2020 Jan - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
show code
#seperatefinal_fit %>%forecast(h =24) %>%autoplot(df[454:492,]) +facet_wrap(~.model)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2020 Jan - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()
All three model attempted to forecast on the recovery period of Covid-19.
These three model Forecast shows reasonable mean over the next two years because the movement of the prediction intervals suggest it capture trend and seasonal variation except SNAIVE model. The three model’s prediction intervals are quite wide around the point forecast and suggest high degree of uncertainty especially for periods further in the future. Implying that the actual values are expected to vary widely around the mean forecasts over time.
SNAIVE model rely on the seasonal patterns observed in historical data to directly forecast future values. Therefore, it does not consider the trend component. In this case, it has given a wider prediction interval compared to the rest. It gives a similair movement as to post 2020 or during Covid-19 times.
The ETS model breaks down the time series into error, damped trend, and seasonality components to capture various patterns and dynamics in the forecast.ARIMA models, on the other hand, account for autoregressive, moving average, and differencing components to model linear dependencies and make predictions. The movement caused by Covid-19 seems to affect the movement of both forecasts as the models detect the cyclical pattern of Covid-19 combined with the previous cyclical pattern from 2005 onwards.
Source Code
---title: "Time Series Forecasting"format: html: code-fold: true code-tools: true code-summary: "show code" toc : true theme: cosmo code-overflow: wrap self-containted: trueauthor: "Karim Abdul Aziz Chatab"editor: visual---```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE, eval=TRUE, error=TRUE, cache=TRUE)library(fpp3);library(gt)library(readr);library(patchwork)# Read in and tidy up your data# First three rows contain metadata, read them in separatelymeta <- read_csv("Datasets/IA_Data.csv", col_names = TRUE, n_max = 3)# meta# The data follows after the third row, we skip the metadata and read the data. # Note: Skipping the first row skips the column names, we add them back from the# metadata.dat <- read_csv("Datasets/IA_Data.csv", # use column names from the metadata col_names = colnames(meta), # skip 4 rows as we also skip column names, specified above skip = 4)# Make sure you select the column with your student IDmy_series <- dat %>% # feel free to rename your series appropriately rename(Month = "Student ID", y ="32228643") %>% select(Month, y) %>% mutate(Month=yearmonth(Month)) %>% as_tsibble(index = Month)```# I. Time series analysis of monthly turnover## I.I Training set time-series plot```{r}#Create a training settrain <- my_series %>%filter(Month <yearmonth("2019 Jan"))#plot the time series with appropriate labelsplot = train %>%autoplot() +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 - 2020)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()#Original plotplot```Very strong seasonality and trend component. handling this with seasonal differencing is required due to seasonality.First order differencing is required as well due to the linear trend. However, transformation is necessary to stabilize the variance, in order to create a simpler and consistent pattern for more accurate forecast.## I.II Initial Data Transformation Selection```{r}#Logarithmic TransformationLog = train %>%autoplot(log(y)) +xlab("Month") +ylab("Logarithmic Turnover") +ggtitle("Logarithmic Transformation") +theme_minimal()#Squared-Root TransformationSQR = train %>%autoplot(sqrt(y)) +xlab("Month") +ylab("Squared-Root of Turnover ") +ggtitle("Squared-Root Transformation") +theme_minimal()#Inference TransformationINF = train %>%autoplot(-1/ y) +xlab("Month") +ylab("Inference of Turnover ") +ggtitle("Inference Transformation") +theme_minimal()#calculate the lambda value through Guerrero methodlambda <- train %>%features(y, features = guerrero) %>%pull(lambda_guerrero)print(paste("Guerrero's value is", lambda)) #Plot the box cox transformation using guerrero as the lambda valuebox_cox = train %>%autoplot(box_cox(y, lambda)) +xlab("Month") +ylab("Box Cox of Turnover ") +ggtitle("Box Cox Transformation") +theme_minimal()#see the plot(box_cox + SQR ) / (Log + INF)```The lambda value from the Guerrero method suggest 0.0228 value, indicates 0 value would be the best in a box cox transformation. 0 lambda value would compute a natural logarithm instead. Furthermore, squared root still has sharp variances and the inference is inconsistent. Therefore, Log transformation would be better.## I.III Initial Data Differencing```{r}#transformationLog#does it need differencing?train %>%features(log(y), feat_stl)#0.917 seasonal strength, yes it does. #How much differencing? train %>%features(log(y), unitroot_nsdiffs)#Seasoned differencing required (n = 1)#First order as well?train %>%features(log(y), unitroot_ndiffs)#First order required (n = 1)#seasoned differencing + logarithmic transformationtrain %>%autoplot(difference(log(y),12)) +theme_minimal()```The seasoned differencing did well by handling the seasonality component from the data. It looks fairly stationary, but indicates a few trending behaviour on the line graph. That is, there are still visible trend patterns or movement in the data.Therefore, following with first order difference would be sufficient.```{r}#first order differencing + logarithmic transformationtrain %>%autoplot(difference(log(y))) +theme_minimal()```The first order differencing sufficiently removes visible trending movement in the dataset. Looks somewhat stationary but the spikes from the plot indicates a strong seasonality component. Therefore, combining first order and seasoned differencing would be ideal to handle both time series components.```{r}#does it still need differencing?train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, feat_stl)#0.0160 seasonal strength, below 0.64, no more. #check again.train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, unitroot_nsdiffs)#seasoned difference not required (n=0)train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, unitroot_ndiffs)#first order difference not required (n=0)#that's enough.#plot the difference combinationtrain %>%autoplot(difference(difference(log(y),12),1)) +theme_minimal()```Most stationary, first order and seasoned differencing has compliment both limitations well. Thus, successfully reducing the trending and seasonality component. No predictable patterns detected. However, there are few long spikes in the beginning but they appear aperiodic. Differencing stabilizes the mean and transformation stabilizes the variances. Select this.```{r}tibble = train %>%mutate(y =difference(difference(log(y),12),1)) %>%features(y, feat_stl)```# II. Autocorrelation and Residual Analysis## II.I Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF)```{r}#first order differencing + transformation ACF and PACFtrain %>%mutate(y =difference(difference(log(y),12),1)) %>%gg_tsdisplay(plot_type ="partial", lag =24) +labs(title ="ACF plot with log transformation and first order + seasoned differencing", y="")#lag = 24, instead of lag = 36. 36 lags might add more parameters```Both plots has significant spikes at lag 12 and PACF lags on 12, 24, and 36 are decaying in significance indicating an MA(1) on seasonal ARIMA part. Before lag 12, ACF has significant lags on 1, 3, 7 while PACF are on lag 1, 2 and 7. AR(2) would be more parsimonious for the model than MA(3) while putting AR(7) or MA(7) is too complex. Therefore, ARIMA(2,1,0)(0,1,1) is sufficient.```{r}#fit the modelfit = train %>%model(arima =ARIMA(log(y) ~pdq(2, 1, 0) +PDQ(0, 1, 1)))report(fit) ```## II.II Residual Diagnostic of the fitted ARIMA- ***Check the whiteness of the residuals from the fitted ARIMA model. Based on these evaluate and if necessary review the ARIMA model** .*```{r}fit %>%gg_tsresiduals()#Generate the Ljung Box's Valueaugment(fit)%>%features(.innov, ljung_box, lag =24, dof =4)#residuals is not white noise, sufficient evidence to reject at 95% confidence level#Generate the Box Pierce Valueaugment(fit)%>%features(.innov, box_pierce,lag =24, dof =4) ```Residuals are approximately Gaussian, centered around 0. Innovation residuals seems homoskedastic, with constant variances that are centered around 0 as well. ACF shows significant spikes, thus, autocorrelation is present, not white noise. Both hypothesis test reject the null at 95% confidence with 4 degrees of freedom at lag 24.# III. Alternative ARIMA combinations## III.I Manual Selection- *C**onsider four alternative ARIMA models. Use information criteria to choose the best model considered so far.***```{r}#use up all possible alternativesfit_4= train %>%model(arima210_011 =ARIMA(log(y) ~pdq(2, 1, 0) +PDQ(0, 1, 1)),arima013_011 =ARIMA(log(y) ~pdq(0, 1, 3) +PDQ(0, 1, 1)),arima210_210 =ARIMA(log(y) ~pdq(2, 1, 0) +PDQ(2, 1, 0)),arima013_210 =ARIMA(log(y) ~pdq(0, 1, 3) +PDQ(2, 1, 0)))#View the Information Criterionglance(fit_4) %>%arrange(AICc) %>%select(.model:BIC)%>%gt() %>%# Add a title and subtitletab_header(title =md("**Information Criterion of ARIMA**"),subtitle =md("Using three different models") ) %>%# Format number columns with commasfmt_number(decimals =4 ) %>%# Add borders and styletab_style(style =cell_borders(sides ="all",color ="grey",weight =px(1) ),locations =cells_body() ) %>%# Apply bold style to headerstab_style(style =cell_text(weight ="bold"),locations =cells_column_labels(everything()) ) ```The Arima that was chosen at question 2 has the lowest AICc value, the ARIMA(2,1,0)(0,1,1)\[12\]. The rest have more parameters, which could suggest lower parameters might be more parsimonious and provides a better fit to the data relative to the complexity of the model.## III.II Automatic `ARIMA()` Selection```{r}#let R work harder to find all possible options.fit_r = train %>%model(arima_r =ARIMA(log(y), stepwise =FALSE, approx =FALSE, order_constraint = p + q + P + Q <=9 ))#make it extra hard for the ARIMA() function to explore #check the model selectionfit_r$arima_r#ARIMA(4,1,2)(1,1,2)[12]#combined to see it bettercombined_fit =cross_join(fit_r, fit_4)#select Q4 chosen model and ARIMA() modelcombined_fit[c("arima210_011","arima_r")] %>%glance() %>%arrange(AICc) %>%select(.model:BIC)#view the Information Criterion for both models#automatic ARIMA() function is better due to lower AICcfit_r %>%gg_tsresiduals()##Generate the Ljung Box's Valueaugment(fit_r)%>%features(.innov, ljung_box, lag =24, dof =6)#residuals is not white noise, sufficient evidence to reject at 95% confidence level#Generate the Box Pierce Valueaugment(fit_r)%>%features(.innov, box_pierce,lag =24, dof =6) ```R-select model is ordered ARIMA(4,1,2)(1,1,2)\[12\]. R-select is different by combining both seasonal and non seasonal AR and MA. Both models uses the same amount of differencing. Choose the R-select model due to lower AICc value.Residuals are approximately Gaussian, centered around 0. Innovation residuals seems homoskedastic, with constant variances that are centered around 0 as well. ACF plot has no significant spikes at all, no autocorrelation present. The portmanteu test accepts the null hypothesis at 95% confidence level. It is white noise. This indicate a good model that sufficiently captured all variations in the data and its not biased.## III.III ARIMA Champion Model Selection```{r}#Comparison of their accuracy measurescombined_fit %>%forecast(h =24)%>%accuracy(my_series) %>%arrange(RMSE)#Comparison of AICC AIC and BICcombined_fit %>%glance() %>%arrange(AICc) %>%select(.model:AICc) %>%gt() %>%tab_header(title ="ARIMA model comparison",subtitle ="Sorted by Akaike Infomation Criterion (AIC)") %>% gt::tab_style(style =cell_text(weight ="bold"), locations =cells_column_labels(everything())) #All arima model have same amount of differencing ```Interestingly, ARIMA(4,1,2)(1,1,2) performed the worse in the test set but has the lowest AICc. ARIMA(0,1,3)(2,1,0) has the lowest error values in all error measure but has the highest AICc. Nevertheless, ARIMA(4,1,2)(1,1,2) is chosen due to lower AICc as it reflects this model fit the best to the entire dataset.# IV. Forecast the next two years```{r}#whole datasetfit_r %>%forecast(h=24) %>%autoplot(my_series) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2016 - 2022) (Forecast to 2022)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()```***Zoom in***```{r}#zoom in fit_r %>%forecast(h=24) %>%autoplot(my_series[ 420:465,]) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2016 - 2022) (Forecast to 2022)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()```It shows reasonable mean over the next two years because the movement of the prediction intervals suggest it capture trend and seasonal. Prediction intervals are quite wide around the point forecast and suggest high degree of uncertainty especially for periods further in the future.# V. Whole data forecast## V.I Full Dataset Plot```{r}library(readxl)#read the data and perform data wranglingexcel =read_xlsx("Datasets/8501011.xlsx", sheet =2, skip =9)#create a tsibble datasetdf = excel %>%select("Series ID", A3349637X) %>%rename(Month =`Series ID`,y = A3349637X) %>%mutate(Month =yearmonth(Month)) %>%as_tsibble(index = Month)#plot the datadf %>%autoplot()+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 March)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()```This dataset is a continuation of the previous monthly turnover dataset. The plot suggests that it might benefit during covid-19 as there is a visible jump on the spikes from 2020 and a slight increase overall onwards.## V.II Forecast using three different methods```{r}#Train train_df = df %>%slice(1:(n()-27)) #best benchmark, Best ETS and ARIMA)best_model = train_df %>%model(ETS =ETS(log(y) ~error("M") +trend("Ad") +season("M")),ARIMA =ARIMA(log(y) ~0+pdq(4,1,2) +PDQ(1,1,2)),SNAIVE =SNAIVE(log(y))) #plot on the whole data setbest_model %>%forecast(h =27) %>%autoplot(df) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()#zoom inbest_model %>%forecast(h =27) %>%autoplot(df[430:492,], alpha =0.6)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2018 Jan - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()#seperatebest_model %>%forecast(h =27) %>%autoplot(df[430:492,]) +facet_wrap(~.model)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2018 Jan - 2023 Mar) (with forecast)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()```Out of all three models, it seems that the model for ETS is the closest from the actual data to the point forecast overall. However, The ETS Model has the widest confidence interval overall. Thus, ETS method has higher level of uncertainty in predicting the future values of the time series compared to the other methods. ETS and ARIMA models has capture the seasonality component and trend component. SNAIVE models use the seasonal pattern from historical data to directly forecast future values.As SNAIVE is more sensitive to the past data, the forecast seem to do better due to Covid-19 increasing the turnover, allowing the actual data to meet a few of the forecast. For ETS model, it seems that the weight given to the past observations when forecasting future values align perfectly with the positive effects of Covid-19. The ARIMA method seems to perform the worse as the point forecast seem to slowly get farther away from the actual data over time.## V.III Accuracy Evaluation```{r}acc_table = best_model%>%forecast(h =27) %>%accuracy(df) %>%arrange(RMSE)#evaluate each model's accuracyacc_table %>%gt() %>%# Add a title and subtitletab_header(title =md("**Accuracy Metrics of Forecast**"),subtitle =md("Using three different models") ) %>%# Format number columns with commasfmt_number(decimals =4 ) %>%# Add borders and style gt::tab_style(style =cell_borders(sides ="all",color ="grey",weight =px(1) ),locations =cells_body() ) %>%# Apply bold style to headers gt::tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels(everything()) ) ```ETS performed the best, it has the lowest RMSE, MAE, MPE and MAPE from all 3 models. Suggesting that the accuracy of the point forecast for ETS is the most accurate forecast as compare to Seasonal Naive method and ARIMA method for this dataset.## V.IV Two years ahead forecast evaluation```{r}#Create your final fit final_fit =df %>%model(ETS =ETS(log(y) ~error("M") +trend("Ad") +season("M")),ARIMA =ARIMA(log(y) ~0+pdq(4,1,2) +PDQ(1,1,2)),SNAIVE =SNAIVE(log(y))) #plot the forecast on the whole data setfinal_fit %>%forecast(h =24) %>%autoplot(df) +ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(1982 Apr - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()#zoom infinal_fit %>%forecast(h =24) %>%autoplot(df[454:492,], alpha =0.6)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2020 Jan - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()#seperatefinal_fit %>%forecast(h =24) %>%autoplot(df[454:492,]) +facet_wrap(~.model)+ggtitle("Monthly turnover of electrical and electronic goods retailing in Queensland", subtitle ="(2020 Jan - 2023 Mar) (forecast to 2025 Mar)") +ylab("Turnover ($ Millions)") +xlab("Month") +theme_minimal()```All three model attempted to forecast on the recovery period of Covid-19.These three model Forecast shows reasonable mean over the next two years because the movement of the prediction intervals suggest it capture trend and seasonal variation except SNAIVE model. The three model's prediction intervals are quite wide around the point forecast and suggest high degree of uncertainty especially for periods further in the future. Implying that the actual values are expected to vary widely around the mean forecasts over time.SNAIVE model rely on the seasonal patterns observed in historical data to directly forecast future values. Therefore, it does not consider the trend component. In this case, it has given a wider prediction interval compared to the rest. It gives a similair movement as to post 2020 or during Covid-19 times.The ETS model breaks down the time series into error, damped trend, and seasonality components to capture various patterns and dynamics in the forecast.ARIMA models, on the other hand, account for autoregressive, moving average, and differencing components to model linear dependencies and make predictions. The movement caused by Covid-19 seems to affect the movement of both forecasts as the models detect the cyclical pattern of Covid-19 combined with the previous cyclical pattern from 2005 onwards.