BANA 7050: Time Series ARIMA Modeling

Author

Saurabh Verma

Section 1

Read library

Code
suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)


options(warn=-1)
})
})

Read data and convert to time series format

Code
milk_df <- read.csv('/Users/saurabhverma/Desktop/UC BANA/Spring 2024 Flex I/Forecasting/milk_price.csv')

 milk_price_tbl_ts <- milk_df %>% select(date, milk_price) %>%
  mutate(value = milk_price) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)

Assess the behavior/data-generating process of your time-series using the techniques discussed in class. From your understanding of the process, would you expect the time-series to be mean stationary? Variance stationary? Why or why not?

Code
milk_df_plot <- ggplot(data = milk_price_tbl_ts, aes(x = date, y = value)) +
  geom_line() +
  xlab("Time") +
  ylab("Values") +
  ggtitle("Time vs Values Plot")
milk_df_plot

According to the time series graph , it’s evident that the series lacks stationary mean and variance. Therefore, our initial approach involves utilizing the rolling standard deviation of the differenced series to examine the data.

If the data appear to be variance non-stationary (using a rolling SD or other method), transform your time-series using a natural log or Box-Cox transformation.

Code
 milk_price_diff <- milk_price_tbl_ts %>%
  mutate(value_diff = value - lag(value)) %>%
  as_tsibble(index=date)

milk_price_diff %>%
mutate(
    diff_sd = zoo::rollapply(
      value, 
      FUN = sd, 
      width = 12, 
      fill = NA)) %>% na.omit()%>%
ggplot()+
geom_line(aes(date,diff_sd))+
geom_smooth(aes(date,diff_sd),method='lm',se=F)+
theme_bw()+
ggtitle("Standard Deviation of Differenced Milk Price,  over Time") +
ylab("SD of Differenced Milk Price") +
xlab("Date")+
theme_bw()
`geom_smooth()` using formula = 'y ~ x'

If seasonality is present in your data, use seasonal differencing to remove the seasonal effect (for example, for monthly data estimate the difference.

Code
milk_price_diff <- milk_price_diff %>% 
  mutate(value_seasonal = difference(value, 12))

milk_price_diff %>%
  gg_tsdisplay(value_seasonal, plot_type='partial', lag=36) +
  labs(title="Seasonally  differenced", y="")

Seasonal differencing reveals the absence of seasonality in the milk data.

Conduct and interpret a KPSS test for stationarity after performing the above operations (if necessary). Do the results suggest that the process is mean stationary? If the process is mean non-stationary, calculate difference the data until mean stationary (after log/Box-Cox and seasonal differencing) and visualize it.

Code
seasonal_value_kpss = milk_price_diff %>% 
     features(value_diff, unitroot_kpss)
seasonal_value_kpss
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0455         0.1

The p-value is > 0.05, suggests that the seasonal value we have derived is now stationary.

Section 2

Produce and interpret ACF/PACF plots of the transformed time series after the above operations. Based on the ACF/PACF alone, does the time-series appear to be an autoregressive process, moving average process, combined, or neither? What might you suspect the order of the time-series to be (e.g. ARIMA(0,1,2)) using the ACF/PACF plots and stationarity tests?

Code
par(mfrow = c(1, 2))
acf(milk_price_diff$value_diff, na.action = na.pass)
pacf(milk_price_diff$value_diff, na.action = na.pass)

The ACF plot indicates a moving average (MA) of order 2, while the significant lag in the PACF plot is also 2, implying the determination of the autoregressive (AR) order.

Based on the data from the lag plots and stationary we can assume ARIMA(2,1,2)

Section 3

Fit several ARIMA models to your time-series variable based on the “best guesses” above. Which model is the “best” according to the AIC or BIC? After comparing several models, implement the automated ARIMA function from fable to find the “best” model. Does the automated ARIMA function select the same model as you did? If not, why do you think this is the case?

Code
models_bic = milk_price_diff %>%
  model(
    mod1 = ARIMA(milk_price~pdq(2,1,2)),
    mod2 = ARIMA(milk_price~pdq(0,0,1)),
    mod3 = ARIMA(milk_price~pdq(1,0,1)),
    mod4 = ARIMA(milk_price~pdq(0,0,0)),
    mod5 = ARIMA(milk_price~pdq(2,0,1)),
    mod6 = ARIMA(milk_price~pdq(2,0,0)),
  )

models_bic %>%
  glance() %>%
  arrange(BIC)
# A tibble: 6 × 8
  .model  sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots  
  <chr>    <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>    
1 mod1   0.00442   441.  -872. -872. -853. <cpl [2]>  <cpl [2]> 
2 mod6   0.00450   437.  -866. -866. -850. <cpl [2]>  <cpl [0]> 
3 mod3   0.00453   436.  -864. -864. -849. <cpl [1]>  <cpl [1]> 
4 mod5   0.00451   437.  -864. -864. -845. <cpl [2]>  <cpl [1]> 
5 mod2   0.0288    117.  -225. -225. -206. <cpl [12]> <cpl [13]>
6 mod4   0.0894    -76.4  161.  161.  176. <cpl [12]> <cpl [12]>

Based on the BIC criterion, the optimal model is ARIMA(2,1,2), which closely resembles the one we observed through the generation of the ACF and PACF lags.

After selecting the best model, derive the fitted values for the time series and plot them against the observed values of the series. Do the in-sample predicted values tend to follow the trends in the data?

Code
best_model <- models_bic['mod1']
fitted <- best_model %>%  augment() %>% .$.fitted

ggplot() +
  geom_line(aes(milk_price_diff$date, milk_price_diff$value)) +
  geom_line(aes(milk_price_diff$date, fitted), color = "blue", alpha=0.5) +
  theme_bw() +
  xlab("Date") +
  ylab("Sale Price") +
  ggtitle("Observed vs Fitted values")

The fitted data closely matches the observed data

Section 4

Compute and analyze the residuals from the selected model.

Code
best_model %>%
  gg_tsresiduals(lag=52)

Conduct a Box-Ljung test for residual autocorrelation and examine the ACF/PACF plots in the residuals. Do the residuals appear to be white noise? If the residuals do not appear to be white noise, either interpret your results (or lack thereof) or suggest a possible solution to the problem.

Code
best_model %>%
  augment() %>%
  features(.innov, ljung_box, lag = 10, dof = 2)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 mod1      2.87     0.942

Given the high p-value (0.9424673), we fail to reject the null hypothesis, suggesting that there is no significant autocorrelation in the residuals. This implies that the residuals appear to be white noise, which is desirable in time series modeling.

Auto Arima

Code
best_model_1 <- milk_price_diff %>%
model(
  ARIMA(value,approximation=F)
) %>%
report()
Series: value 
Model: ARIMA(2,1,2) 

Coefficients:
         ar1      ar2     ma1     ma2
      0.0485  -0.4414  0.2674  0.5222
s.e.  0.2421   0.1281  0.2327  0.1237

sigma^2 estimated as 0.004423:  log likelihood=441.04
AIC=-872.08   AICc=-871.9   BIC=-852.94

The automated ARIMA procedure identifies ARIMA(2,1,2) as the optimal model.

Auto Arima provides the same ARIMA order as manual ARIMA selection through BIC.

Finally, create a 6 time-period forecast for your model from the end of the data. Do the forecasted values seem reasonable? Why or why not?

Code
best_model_1 %>%
  forecast(h=6) %>%
  autoplot(milk_price_diff)+
  ylab('Milk Price')+
  ggtitle("6 months Forecast")+
  theme_bw()

The forecast appears reasonable as it captures the trend of the time series data accurately.