The data set given is the vehicle sales over years recorded month on month. This data set would be seasonal as the vehicle sales depends on time period of the year, economic conditions, etc. As we can from the below plot the data set seem to be variance stationary with the vehicle sales and the mean is also stationary over the period of time based on the KPSS test performed which gave the KPSS stat value of 1.52 and the p-value of 0.1 which is less than that of the significance level. This means that we reject the null hypothesis (Null Hypothesis- The vehicle sales series has a constant mean and variance over time).
Code
ggplot() +geom_line(aes(x = date, y = data$vehicle_sales)) +labs(x ="Month", y ="Vehicle Sales", title ="Monthly Vehicle Sales Over Time")
Warning in kpss.test(data$vehicle_sales): p-value smaller than printed p-value
KPSS Test for Level Stationarity
data: data$vehicle_sales
KPSS Level = 1.5216, Truncation lag parameter = 6, p-value = 0.01
------Section 2---------
Based on the interpretation from the below ACF and PACF plots the dampening suggests that the model could be AR process, maybe of order 2. Also, it could be MA process of order 3 based on the significance levels in PACF plot.
The significance levels at 12 and 24 suggests that the data could also be a seasonal arima process.
Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type 'partial' will
be truncated to first character
Warning in plot.xy(xy.coords(x, y), type = type, ...): plot type 'partial' will
be truncated to first character
Warning in plot.xy(xy, type, ...): plot type 'partial' will be truncated to
first character
Warning in plot.xy(xy, type, ...): plot type 'partial' will be truncated to
first character
-------Section 3-------
My best guess model was ARIMA(2,1,3). Based on the dampening I assumed that the model might be a AR2 but it seems that the best fit model does not have the AR process at all. The best fit according to the BIC (the least BIC Value) is Arima(3, 1, 3) with the BIC value of 7342.77.
I believe that the reason could be because, I did not fit the seasonal effects in my manual model with PDQ being 0 in all cases.
Code
mod1 =Arima(data$vehicle_sales, order =c(2, 1, 1), seasonal =c(0, 0, 0)) mod2 =Arima(data$vehicle_sales, order =c(2, 1, 3), seasonal =c(0, 0, 0)) ##Predicted model without seasonalitymod3 =Arima(data$vehicle_sales, order =c(1,1,3), seasonal =c(0,0,0))modauto <-auto.arima(data$vehicle_sales, approximation=F, seasonal = T)summary(mod1)
Series: data$vehicle_sales
ARIMA(2,1,1)
Coefficients:
ar1 ar2 ma1
-1.0981 -0.4598 0.6621
s.e. 0.0716 0.0386 0.0736
sigma^2 = 22928: log likelihood = -3694.67
AIC=7397.34 AICc=7397.41 BIC=7414.75
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 0.9660185 150.8907 119.5875 -0.9175722 9.751778 0.8897796
ACF1
Training set -0.02574537
Code
summary(mod2)
Series: data$vehicle_sales
ARIMA(2,1,3)
Coefficients:
ar1 ar2 ma1 ma2 ma3
-0.0946 0.4633 -0.4233 -0.6986 0.2554
s.e. 0.1141 0.0766 0.1115 0.0671 0.0627
sigma^2 = 21955: log likelihood = -3681.4
AIC=7374.81 AICc=7374.96 BIC=7400.92
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 2.353971 147.3964 115.3181 -1.0964 9.442797 0.8580137 -0.03539981
Code
summary(mod3)
Series: data$vehicle_sales
ARIMA(1,1,3)
Coefficients:
ar1 ma1 ma2 ma3
0.5616 -1.1651 0.3672 -0.1115
s.e. 0.0919 0.0953 0.0956 0.0495
sigma^2 = 22033: log likelihood = -3682.92
AIC=7375.85 AICc=7375.95 BIC=7397.61
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 2.404843 147.7884 115.7806 -1.092335 9.448824 0.8614551 0.01561606
Code
summary(modauto)
Series: data$vehicle_sales
ARIMA(3,1,3)
Coefficients:
ar1 ar2 ar3 ma1 ma2 ma3
-0.5134 -0.5280 0.4515 0.0190 0.1831 -0.8445
s.e. 0.0503 0.0497 0.0491 0.0301 0.0298 0.0260
sigma^2 = 19574: log likelihood = -3649.15
AIC=7312.3 AICc=7312.5 BIC=7342.77
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 2.325575 139.0539 110.2803 -0.9573613 9.040582 0.8205302
ACF1
Training set -0.02804847
The fitted model has failed in predicting the trends with a difference which is significat even when the i specifically mentioned ‘seasonl = TRUE’. In the previously fitted model seasonality was not considered hence the difference in fitted vs actual. Below is the model with the seasonality effect which is based on trail and error method using ‘Arima’. Here the PDQ values are (1,1,1) with BIC value of 6808.66.
The Box-Ljung test gave the p-value of 0.0117 for auto arima plot which is less than the significance value meaning that the null hypothesis must be rejected. As the auto arima model does not contain the seasonality effect I have created a new model with the seasonality effect which has the value of 0.3362 which is clearly higher than 0.05. Hence, we are using the model with seasonality effect.
The observed vs fitted values of the seasonal arima almost captures the trends and seasonality of the data. Hence the correction was made and the updated model is the best fit model.
Code
##Residual and ACF plot of auto arima model without seasonalityautoplot(residuals(modauto))
Code
resauto <-residuals(modauto)Acf(resauto, main ="ACF of Residuals")
Code
hist(resauto)
Code
# Conduct Box-Ljung test for auto arima model without seasonalityBox.test(resauto, lag =10, type ="Ljung-Box")
##Model with seasonalitysarima1 <-Arima(data$vehicle_sales, order =c(3, 1, 3), seasonal =list(order =c(1, 1, 1), period =12))summary(modauto)
Series: data$vehicle_sales
ARIMA(3,1,3)
Coefficients:
ar1 ar2 ar3 ma1 ma2 ma3
-0.5134 -0.5280 0.4515 0.0190 0.1831 -0.8445
s.e. 0.0503 0.0497 0.0491 0.0301 0.0298 0.0260
sigma^2 = 19574: log likelihood = -3649.15
AIC=7312.3 AICc=7312.5 BIC=7342.77
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 2.325575 139.0539 110.2803 -0.9573613 9.040582 0.8205302
ACF1
Training set -0.02804847
Code
summary(sarima1)
Series: data$vehicle_sales
ARIMA(3,1,3)(1,1,1)[12]
Coefficients:
ar1 ar2 ar3 ma1 ma2 ma3 sar1 sma1
-0.8840 -0.6987 0.2616 0.4187 0.1707 -0.7088 0.1314 -0.8503
s.e. 0.0809 0.0924 0.0802 0.0626 0.0665 0.0570 0.0528 0.0286
sigma^2 = 9527: log likelihood = -3375.84
AIC=6769.68 AICc=6770 BIC=6808.66
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -1.002446 95.80959 68.06646 -0.4734036 5.607386 0.5064422
ACF1
Training set -0.002777798
Code
## Fitted vs actual based on arima with seasonal effect(manual)fitted1 <-fitted(sarima1)plot(data$vehicle_sales, type='l', col='blue', xlab='Time', ylab='Vehicle Sales', main='Observed vs Fitted Values')lines(fitted1, col='red')legend("topleft", legend=c("Obs", "Fit"), col=c("blue", "red"), lty=1)
Code
##Residual and ACF plot of arima with seasonality (manual)autoplot(residuals(sarima1))
Code
sarimaplot <-residuals(sarima1)Acf(sarimaplot, main ="ACF of Residuals")
Code
hist(sarimaplot)
As we can infer from the ACF plot, the residuals seem to be white noise hence we can proceed with forecasting.
Code
# Box-Ljung test for arima model with seasonalityBox.test(sarimaplot, lag =10, type ="Ljung-Box")
forecast_values <-forecast(sarima1, h =6)forecast_values
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
576 1504.150 1379.0592 1629.241 1312.8400 1695.461
577 1098.662 956.8123 1240.512 881.7215 1315.603
578 1247.028 1095.8428 1398.212 1015.8104 1478.245
579 1504.851 1345.6356 1664.067 1261.3519 1748.351
580 1334.173 1168.1691 1500.178 1080.2918 1588.055
581 1465.955 1293.3077 1638.602 1201.9139 1729.995
Code
# Plot the forecastautoplot(forecast_values, conf.int =TRUE) +ylab('Retail Sales') +ggtitle('6 Month Forecast of Retail Sales')
Above, forecast values seem to be reasonable. The reason being that it has captured the seasonal effect in the data set. This consideration of all the trends that affects that sales seem to address the forecast with accuracy. As you can see from the above graph the projected vehicle sales in december(1504, As the vehicle sales data ends till november in the data set provided) seem to be higher than the future months which is in line with the sales trends over time.