Time Series ARIMA Modeling

Author

Lasya Pinnamaneni

Section 1

Read library

Code
suppressWarnings({
  suppressMessages({
library(dplyr)
library(ggplot2)
library(lubridate)
library(gridExtra)
library(forecast)
library(modeltime)
library(fabletools)
library(feasts)
library(tsibble)
library(tibble)
library(fable)


options(warn=-1)
})
})

Read data and convert to time series format

Code
vehicle_df <- read.csv('/Users/lasyapinnamaneni/Desktop/MS BANA/Time series/total_vehicle_sales.csv')

 vehicle_sales_tbl_ts <- vehicle_df %>% select(date, vehicle_sales) %>%
  mutate(value = vehicle_sales) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)

Assess the behavior/data-generating process of your time-series using the techniques discussed in class. From your understanding of the process, would you expect the time-series to be mean stationary? Variance stationary? Why or why not?

Code
vehicle_df_plot <- ggplot(data = vehicle_sales_tbl_ts, aes(x = date, y = value)) +
  geom_line() +
  xlab("Time") +
  ylab("Values") +
  ggtitle("Time vs Values Plot")
vehicle_df_plot

According to the time series graph , it’s evident that the series lacks stationary mean and variance. Therefore, our initial approach involves utilizing the rolling standard deviation of the differenced series to examine the data.

If the data appear to be variance non-stationary (using a rolling SD or other method), transform your time-series using a natural log or Box-Cox transformation.

Code
vehicle_sales_diff <- vehicle_sales_tbl_ts %>%
  mutate(value_diff = value - lag(value, 2)) %>%
  as_tsibble(index = date)

vehicle_sales_diff %>%
mutate(
    diff_sd = zoo::rollapply(
      value, 
      FUN = sd, 
      width = 12, 
      fill = NA)) %>% na.omit()%>%
ggplot()+
geom_line(aes(date,diff_sd))+
geom_smooth(aes(date,diff_sd),method='lm',se=F)+
theme_bw()+
ggtitle("Standard Deviation of Differenced Vehicle sales,  over Time") +
ylab("SD of Differenced Vehicle Sales") +
xlab("Date")+
theme_bw()
`geom_smooth()` using formula = 'y ~ x'

If seasonality is present in your data, use seasonal differencing to remove the seasonal effect (for example, for monthly data estimate the difference.

Code
vehicle_sales_diff <- vehicle_sales_diff %>% 
  mutate(value_seasonal = difference(value, 12))

vehicle_sales_diff %>%
  gg_tsdisplay(value_seasonal, plot_type='partial', lag=40) +
  labs(title="Seasonally  differenced", y="")

Seasonal differencing reveals the absence of seasonality in the vehicle data.

Conduct and interpret a KPSS test for stationarity after performing the above operations (if necessary). Do the results suggest that the process is mean stationary? If the process is mean non-stationary, calculate difference the data until mean stationary (after log/Box-Cox and seasonal differencing) and visualize it.

Code
seasonal_value_kpss = vehicle_sales_diff %>% 
     features(value_diff, unitroot_kpss)
seasonal_value_kpss
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0190         0.1

The p-value is > 0.05, suggests that the seasonal value we have derived is now stationary.

Section 2

Produce and interpret ACF/PACF plots of the transformed time series after the above operations. Based on the ACF/PACF alone, does the time-series appear to be an autoregressive process, moving average process, combined, or neither? What might you suspect the order of the time-series to be (e.g. ARIMA(0,1,2)) using the ACF/PACF plots and stationarity tests?

Code
par(mfrow = c(1, 2))
acf(vehicle_sales_diff$value_diff, na.action = na.pass)
pacf(vehicle_sales_diff$value_diff, na.action = na.pass)

The ACF plot indicates a moving average (MA) of order 2, while the significant lag in the PACF plot is also 2, implying the determination of the autoregressive (AR) order.

Based on the data from the lag plots and stationary we can assume ARIMA(2,0,2)

Section 3

Fit several ARIMA models to your time-series variable based on the “best guesses” above. Which model is the “best” according to the AIC or BIC? After comparing several models, implement the automated ARIMA function from fable to find the “best” model. Does the automated ARIMA function select the same model as you did? If not, why do you think this is the case?

Code
models_bic = vehicle_sales_diff %>%
  model(
    mod1 = ARIMA(vehicle_sales~pdq(0,1,0)+PDQ(1,1,0)),
    mod2 = ARIMA(vehicle_sales~pdq(1,1,1)+PDQ(1,1,0)),
    mod3 = ARIMA(vehicle_sales~pdq(1,0,1)+PDQ(1,1,0)),
    mod4 = ARIMA(vehicle_sales~pdq(2,0,2)+PDQ(1,1,0)),
    mod5 = ARIMA(vehicle_sales~pdq(2,0,1)+PDQ(1,1,0)),
    mod6 = ARIMA(vehicle_sales~pdq(2,0,0)+PDQ(1,1,0)),
  )

models_bic %>%
  glance() %>%
  arrange(BIC)
# A tibble: 6 × 8
  .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
  <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
1 mod4   12412.  -3452. 6915. 6915. 6941. <cpl [14]> <cpl [2]>
2 mod3   12774.  -3460. 6928. 6928. 6946. <cpl [13]> <cpl [1]>
3 mod2   13130.  -3462. 6931. 6931. 6949. <cpl [13]> <cpl [1]>
4 mod5   12763.  -3459. 6929. 6929. 6950. <cpl [14]> <cpl [1]>
5 mod6   13218.  -3470. 6947. 6947. 6965. <cpl [14]> <cpl [0]>
6 mod1   16676.  -3529. 7063. 7063. 7071. <cpl [12]> <cpl [0]>

Based on the BIC criterion, the optimal model is ARIMA(2,1,2), which closely resembles the one we observed through the generation of the ACF and PACF lags.

After selecting the best model, derive the fitted values for the time series and plot them against the observed values of the series. Do the in-sample predicted values tend to follow the trends in the data?

Code
best_model <- models_bic['mod4']
fitted <- best_model %>%  augment() %>% .$.fitted

ggplot() +
  geom_line(aes(vehicle_sales_diff$date, vehicle_sales_diff$value)) +
  geom_line(aes(vehicle_sales_diff$date, fitted), color = "blue", alpha=0.5) +
  theme_bw() +
  xlab("Date") +
  ylab("Sales") +
  ggtitle("Observed vs Fitted values")

The fitted data closely matches the observed data

Section 4

Compute and analyze the residuals from the selected model.

Code
best_model %>%
  gg_tsresiduals(lag=52)

Conduct a Box-Ljung test for residual autocorrelation and examine the ACF/PACF plots in the residuals. Do the residuals appear to be white noise? If the residuals do not appear to be white noise, either interpret your results (or lack thereof) or suggest a possible solution to the problem.

Code
best_model %>%
  augment() %>%
  features(.innov, ljung_box, lag = 10, dof = 2)
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 mod4      10.7     0.218


With a high p-value of 0.2183795, we do not have enough evidence to reject the null hypothesis. This lack of significance suggests that there is no substantial autocorrelation present in the residuals. Consequently, it implies that the residuals exhibit characteristics akin to white noise, a favorable outcome in time series modeling.

Auto Arima

Code
best_model_1 <- vehicle_sales_diff %>%
model(
  ARIMA(value,approximation=F)
) %>%
report()
Series: value 
Model: ARIMA(2,0,2)(0,1,2)[12] 

Coefficients:
         ar1     ar2     ma1      ma2     sma1     sma2
      0.1906  0.7288  0.2886  -0.4704  -0.6757  -0.1606
s.e.  0.1587  0.1459  0.1541   0.0662   0.0438   0.0418

sigma^2 estimated as 9883:  log likelihood=-3392
AIC=6797.99   AICc=6798.19   BIC=6828.32

The automated ARIMA procedure identifies ARIMA(2,0,2) (0,1,2) [12] as the optimal model.

Finally, create a 6 time-period forecast for your model from the end of the data. Do the forecasted values seem reasonable? Why or why not?

Code
best_model_1 %>%
  forecast(h=6) %>%
  autoplot(vehicle_sales_diff)+
  ylab('Vehicle sales')+
  ggtitle("6 months Forecast")+
  theme_bw()

The forecast appears reasonable as it captures the trend of the time series data accurately.