Forecasting Methods Assignment 3

Author

Sai Lohith Roy Valleri

Code

library(ggplot2)
library(slider)

Warning: package 'slider' was built under R version 4.3.2

Code

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Code

library(tidyr)

Warning: package 'tidyr' was built under R version 4.3.2

Code

library(readxl)

Warning: package 'readxl' was built under R version 4.3.2

Code

library(zoo)


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

Code

#install.packages("forecast")
library(forecast)

Warning: package 'forecast' was built under R version 4.3.2

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Code

library(lubridate)

Warning: package 'lubridate' was built under R version 4.3.2


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

Code

library(tsfeatures)

Warning: package 'tsfeatures' was built under R version 4.3.2

Code

library(tseries)

Warning: package 'tseries' was built under R version 4.3.2

Code

data <- read.csv("total_vehicle_sales.csv", header = TRUE)
head(data)

        date vehicle_sales
1 1976-01-01         885.2
2 1976-02-01         994.7
3 1976-03-01        1243.6
4 1976-04-01        1191.2
5 1976-05-01        1203.2
6 1976-06-01        1254.7

Code

date <- as.Date(data$date, format = "%Y-%m-%d")
str(date)

 Date[1:575], format: "1976-01-01" "1976-02-01" "1976-03-01" "1976-04-01" "1976-05-01" ...

-------Section 1-------

The data set given is the vehicle sales over years recorded month on month. This data set would be seasonal as the vehicle sales depends on time period of the year, economic conditions, etc. As we can from the below plot the data set seem to be variance stationary with the vehicle sales and the mean is also stationary over the period of time based on the KPSS test performed which gave the KPSS stat value of 1.52 and the p-value of 0.1 which is less than that of the significance level. This means that we reject the null hypothesis (Null Hypothesis- The vehicle sales series has a constant mean and variance over time).

Code

ggplot() +
  geom_line(aes(x = date, y = data$vehicle_sales)) +
  labs(x = "Month", y = "Vehicle Sales", title = "Monthly Vehicle Sales Over Time")

Code

data %>%
mutate(rolling_sd = slide_dbl(vehicle_sales, ~sd(.),.before = 6,.after = 6)) %>%
ggplot(aes(date, rolling_sd)) +
  geom_line(aes(group = 1)) +
  theme_bw() +
  xlab("Date") +
  ylab("Rolling SD")

Code

kpss.test(data$vehicle_sales)

Warning in kpss.test(data$vehicle_sales): p-value smaller than printed p-value


    KPSS Test for Level Stationarity

data:  data$vehicle_sales
KPSS Level = 1.5216, Truncation lag parameter = 6, p-value = 0.01

------Section 2---------

Based on the interpretation from the below ACF and PACF plots the dampening suggests that the model could be AR process, maybe of order 2. Also, it could be MA process of order 3 based on the significance levels in PACF plot.

The significance levels at 12 and 24 suggests that the data could also be a seasonal arima process.

Code

tsdisplay(diff((data$vehicle_sales),12),
            type='partial', lag=24)

Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type 'partial' will
be truncated to first character

Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type 'partial' will
be truncated to first character

Warning in plot.xy(xy, type, ...): plot type 'partial' will be truncated to
first character

Warning in plot.xy(xy, type, ...): plot type 'partial' will be truncated to
first character

-------Section 3-------

My best guess model was ARIMA(2,1,3). Based on the dampening I assumed that the model might be a AR2 but it seems that the best fit model does not have the AR process at all. The best fit according to the BIC (the least BIC Value) is Arima(3, 1, 3) with the BIC value of 7342.77.

I believe that the reason could be because, I did not fit the seasonal effects in my manual model with PDQ being 0 in all cases.

Code

mod1 = Arima(data$vehicle_sales, order = c(2, 1, 1), seasonal = c(0, 0, 0)) 
mod2 = Arima(data$vehicle_sales, order = c(2, 1, 3), seasonal = c(0, 0, 0)) ##Predicted model without seasonality
mod3 = Arima(data$vehicle_sales, order = c(1,1,3), seasonal = c(0,0,0))
modauto <- auto.arima(data$vehicle_sales, approximation=F, seasonal = T)


summary(mod1)

Series: data$vehicle_sales 
ARIMA(2,1,1) 

Coefficients:
          ar1      ar2     ma1
      -1.0981  -0.4598  0.6621
s.e.   0.0716   0.0386  0.0736

sigma^2 = 22928:  log likelihood = -3694.67
AIC=7397.34   AICc=7397.41   BIC=7414.75

Training set error measures:
                    ME     RMSE      MAE        MPE     MAPE      MASE
Training set 0.9660185 150.8907 119.5875 -0.9175722 9.751778 0.8897796
                    ACF1
Training set -0.02574537

Code

summary(mod2)

Series: data$vehicle_sales 
ARIMA(2,1,3) 

Coefficients:
          ar1     ar2      ma1      ma2     ma3
      -0.0946  0.4633  -0.4233  -0.6986  0.2554
s.e.   0.1141  0.0766   0.1115   0.0671  0.0627

sigma^2 = 21955:  log likelihood = -3681.4
AIC=7374.81   AICc=7374.96   BIC=7400.92

Training set error measures:
                   ME     RMSE      MAE     MPE     MAPE      MASE        ACF1
Training set 2.353971 147.3964 115.3181 -1.0964 9.442797 0.8580137 -0.03539981

Code

summary(mod3)

Series: data$vehicle_sales 
ARIMA(1,1,3) 

Coefficients:
         ar1      ma1     ma2      ma3
      0.5616  -1.1651  0.3672  -0.1115
s.e.  0.0919   0.0953  0.0956   0.0495

sigma^2 = 22033:  log likelihood = -3682.92
AIC=7375.85   AICc=7375.95   BIC=7397.61

Training set error measures:
                   ME     RMSE      MAE       MPE     MAPE      MASE       ACF1
Training set 2.404843 147.7884 115.7806 -1.092335 9.448824 0.8614551 0.01561606

Code

summary(modauto)

Series: data$vehicle_sales 
ARIMA(3,1,3) 

Coefficients:
          ar1      ar2     ar3     ma1     ma2      ma3
      -0.5134  -0.5280  0.4515  0.0190  0.1831  -0.8445
s.e.   0.0503   0.0497  0.0491  0.0301  0.0298   0.0260

sigma^2 = 19574:  log likelihood = -3649.15
AIC=7312.3   AICc=7312.5   BIC=7342.77

Training set error measures:
                   ME     RMSE      MAE        MPE     MAPE      MASE
Training set 2.325575 139.0539 110.2803 -0.9573613 9.040582 0.8205302
                    ACF1
Training set -0.02804847

The fitted model has failed in predicting the trends with a difference which is significat even when the i specifically mentioned ‘seasonl = TRUE’. In the previously fitted model seasonality was not considered hence the difference in fitted vs actual. Below is the model with the seasonality effect which is based on trail and error method using ‘Arima’. Here the PDQ values are (1,1,1) with BIC value of 6808.66.

Code

fitted <- fitted(modauto)
plot(data$vehicle_sales, type='l', col='blue', xlab='Time', ylab='Vehicle Sales', main='Observed vs Fitted Values')
lines(fitted, col='red')
legend("topleft", legend=c("Obs", "Fit"), col=c("blue", "red"), lty=1)

------Section 4-----

The Box-Ljung test gave the p-value of 0.0117 for auto arima plot which is less than the significance value meaning that the null hypothesis must be rejected. As the auto arima model does not contain the seasonality effect I have created a new model with the seasonality effect which has the value of 0.3362 which is clearly higher than 0.05. Hence, we are using the model with seasonality effect.

The observed vs fitted values of the seasonal arima almost captures the trends and seasonality of the data. Hence the correction was made and the updated model is the best fit model.

Code

##Residual and ACF plot of auto arima model without seasonality
autoplot(residuals(modauto))

Code

resauto <- residuals(modauto)
Acf(resauto, main = "ACF of Residuals")

Code

hist(resauto)

Code

# Conduct Box-Ljung test for auto arima model without seasonality
Box.test(resauto, lag = 10, type = "Ljung-Box")


    Box-Ljung test

data:  resauto
X-squared = 93.575, df = 10, p-value = 9.992e-16

Code

##Model with seasonality
sarima1 <- Arima(data$vehicle_sales, order = c(3, 1, 3), seasonal = list(order = c(1, 1, 1), period = 12))

summary(modauto)

Series: data$vehicle_sales 
ARIMA(3,1,3) 

Coefficients:
          ar1      ar2     ar3     ma1     ma2      ma3
      -0.5134  -0.5280  0.4515  0.0190  0.1831  -0.8445
s.e.   0.0503   0.0497  0.0491  0.0301  0.0298   0.0260

sigma^2 = 19574:  log likelihood = -3649.15
AIC=7312.3   AICc=7312.5   BIC=7342.77

Training set error measures:
                   ME     RMSE      MAE        MPE     MAPE      MASE
Training set 2.325575 139.0539 110.2803 -0.9573613 9.040582 0.8205302
                    ACF1
Training set -0.02804847

Code

summary(sarima1)

Series: data$vehicle_sales 
ARIMA(3,1,3)(1,1,1)[12] 

Coefficients:
          ar1      ar2     ar3     ma1     ma2      ma3    sar1     sma1
      -0.8840  -0.6987  0.2616  0.4187  0.1707  -0.7088  0.1314  -0.8503
s.e.   0.0809   0.0924  0.0802  0.0626  0.0665   0.0570  0.0528   0.0286

sigma^2 = 9527:  log likelihood = -3375.84
AIC=6769.68   AICc=6770   BIC=6808.66

Training set error measures:
                    ME     RMSE      MAE        MPE     MAPE      MASE
Training set -1.002446 95.80959 68.06646 -0.4734036 5.607386 0.5064422
                     ACF1
Training set -0.002777798

Code

## Fitted vs actual based on arima with seasonal effect(manual)
fitted1 <- fitted(sarima1)
plot(data$vehicle_sales, type='l', col='blue', xlab='Time', ylab='Vehicle Sales', main='Observed vs Fitted Values')
lines(fitted1, col='red')
legend("topleft", legend=c("Obs", "Fit"), col=c("blue", "red"), lty=1)

Code

##Residual and ACF plot of arima with seasonality (manual)
autoplot(residuals(sarima1))

Code

sarimaplot <- residuals(sarima1)
Acf(sarimaplot, main = "ACF of Residuals")

Code

hist(sarimaplot)

As we can infer from the ACF plot, the residuals seem to be white noise hence we can proceed with forecasting.

Code

# Box-Ljung test for arima model with seasonality
Box.test(sarimaplot, lag = 10, type = "Ljung-Box")


    Box-Ljung test

data:  sarimaplot
X-squared = 11.28, df = 10, p-value = 0.3362

Code

forecast_values <- forecast(sarima1, h = 6)
forecast_values

    Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
576       1504.150 1379.0592 1629.241 1312.8400 1695.461
577       1098.662  956.8123 1240.512  881.7215 1315.603
578       1247.028 1095.8428 1398.212 1015.8104 1478.245
579       1504.851 1345.6356 1664.067 1261.3519 1748.351
580       1334.173 1168.1691 1500.178 1080.2918 1588.055
581       1465.955 1293.3077 1638.602 1201.9139 1729.995

Code

# Plot the forecast
autoplot(forecast_values, conf.int = TRUE) +
  ylab('Retail Sales') +
  ggtitle('6 Month Forecast of Retail Sales')

Above, forecast values seem to be reasonable. The reason being that it has captured the seasonal effect in the data set. This consideration of all the trends that affects that sales seem to address the forecast with accuracy. As you can see from the above graph the projected vehicle sales in december(1504, As the vehicle sales data ends till november in the data set provided) seem to be higher than the future months which is in line with the sales trends over time.