Code
x <- ggplot(data = sales_tbl_ts, aes(x = date, y = value)) +
geom_line() +
xlab("Time") +
ylab("Values") +
ggtitle("Time vs Values Plot")
xLet’s plot the sales price median data over the last 5 years.
x <- ggplot(data = sales_tbl_ts, aes(x = date, y = value)) +
geom_line() +
xlab("Time") +
ylab("Values") +
ggtitle("Time vs Values Plot")
xBased on the time series graph plotted above, we see that the series does not have stationary mean and variance , so first we’ll use rolling SD of differenced series and check the data.
As we see from the above plot we can clearly see mean and SD variance. We will try to check the variance after taking the rolling difference.
sales_diff <- sales_tbl_ts %>%
mutate(value_diff = value - lag(value)) %>%
as_tsibble(index=date)
#| code-fold: true
sales_diff %>%
mutate(
diff_sd = zoo::rollapply(
value,
FUN = sd,
width = 12,
fill = NA)) %>%
ggplot()+
geom_line(aes(date,diff_sd))+
geom_smooth(aes(date,diff_sd),method='lm',se=F)+
theme_bw()+
ggtitle("Standard Deviation of Differenced Sales Price, over Time") +
ylab("SD of Differenced Sales Price") +
xlab("Date")+
theme_bw()`geom_smooth()` using formula = 'y ~ x'
After doing the rolling SD of differences series we can see negligible variance, based on the output we don’t need to do log transformation or BoxCox transformation.
As its a property sales price data and we have observed seasonality, in the next step we will check the seasonal difference. We are using a value difference of 1 year.
sales_diff <- sales_diff %>%
mutate(value_seasonal = difference(value,52)
)
sales_diff %>%
gg_tsdisplay(value_seasonal,plot_type='partial', lag=36) +
labs(title="Seasonally differenced", y="")Checking the stationary using KPSS test.
seasonal_value_kpss = sales_diff %>%
features(value_seasonal, unitroot_kpss)
seasonal_value_kpss# A tibble: 1 × 2
kpss_stat kpss_pvalue
<dbl> <dbl>
1 0.417 0.0697
The p-value is > 0.05, so we can say the seasonal value that we have derived is is stationary now.
ACF and PACF plots using the seasonal data.
par(mfrow = c(1, 2))
acf(sales_diff$value_seasonal, na.action = na.pass)
pacf(sales_diff$value_seasonal, na.action = na.pass)We can see the dampening effect in ACF plot so AR order can be determined using the PACF significant lag which in this case is 2.
Based on the data from the lag plots and stationary we can assume ARIMA(2,0,0)
Build and Compare Some Models by BIC, based on the below result set. Model 4 with ARIMA (2,0,0) is the best model.
models_bic = sales_diff %>%
model(
mod1 = ARIMA(value~pdq(0,0,0)+PDQ(0,1,0)),
mod2 = ARIMA(value~pdq(2,0,1)+PDQ(0,1,0)),
mod3 = ARIMA(value~pdq(1,0,0)+PDQ(0,1,0)),
mod4 = ARIMA(value~pdq(2,0,0)+PDQ(0,1,0)),
mod5 = ARIMA(value~pdq(1,0,1)+PDQ(0,1,0)),
mod6 = ARIMA(value~pdq(0,0,2)+PDQ(0,1,0)),
)
models_bic %>%
glance() %>%
arrange(BIC)# A tibble: 6 × 8
.model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
1 mod4 44027366. -2116. 4239. 4239. 4249. <cpl [2]> <cpl [0]>
2 mod5 45741846. -2121. 4248. 4248. 4258. <cpl [1]> <cpl [1]>
3 mod2 45847100. -2121. 4249. 4249. 4263. <cpl [2]> <cpl [1]>
4 mod3 87136953. -2187. 4379. 4379. 4389. <cpl [1]> <cpl [0]>
5 mod6 331885343. -2326. 4659. 4659. 4672. <cpl [0]> <cpl [2]>
6 mod1 3665291262. -2573. 5149. 5149. 5156. <cpl [0]> <cpl [0]>
According to BIC model 4 ARIMA(2,0,0) is the best model,below we are deriving the fitted values and plotting them against the observed values
best_model <- models_bic['mod4']
fitted <- best_model %>% augment() %>% .$.fitted
ggplot() +
geom_line(aes(sales_diff$date, sales_diff$value)) +
geom_line(aes(sales_diff$date, fitted), color = "blue", alpha=0.5) +
theme_bw() +
xlab("Date") +
ylab("Sale Price") +
ggtitle("Observed vs Fitted values")The in-sample data matches very closely to the observed values.
Analyzing the residuals from the selected model
best_model %>%
gg_tsresiduals(lag=52)Box-Ljung test for residual autocorrelation
best_model %>%
augment() %>%
features(.innov, ljung_box, lag = 10, dof = 2)# A tibble: 1 × 3
.model lb_stat lb_pvalue
<chr> <dbl> <dbl>
1 mod4 63.9 8.00e-11
p-value for auto Box -Ljung test is less than 0.05, that means correlation exists and we don’t have optimum model.We can try auto arima to find out the best model 1st and then plot the forecast.
“Best” model using automated ARIMA function from fable
best_model_1 <- sales_diff %>%
model(
ARIMA(value,approximation=F)
) %>%
report()Series: value
Model: ARIMA(3,0,1)(1,1,0)[52]
Coefficients:
ar1 ar2 ar3 ma1 sar1
0.9589 0.4572 -0.4421 0.9842 -0.5297
s.e. 0.0646 0.0935 0.0652 0.0187 0.0579
sigma^2 estimated as 26447684: log likelihood=-2071.73
AIC=4155.46 AICc=4155.88 BIC=4175.46
6 time-period Seasonal ARIMA forecast
We will plot the next 6 weeks forecast using the auto arima model ARIMA(3,0,1)(1,1,0)
best_model_1 %>%
forecast(h=6) %>%
autoplot(sales_diff)+
ylab('Sales Price')+
ggtitle("6 Weeks Forecast")+
theme_bw()