San Fransisco Zillow Weekly Sales Forecasting Week2

Section 1

Time Series Data

Let’s plot the sales price median data over the last 5 years.

Code

x <- ggplot(data = sales_tbl_ts, aes(x = date, y = value)) +
  geom_line() +
  xlab("Time") +
  ylab("Values") +
  ggtitle("Time vs Values Plot")
x

Based on the time series graph plotted above, we see that the series does not have stationary mean and variance , so first we’ll use rolling SD of differenced series and check the data.

Rolling SD of Differenced Series

As we see from the above plot we can clearly see mean and SD variance. We will try to check the variance after taking the rolling difference.

Code

sales_diff <- sales_tbl_ts %>%
  mutate(value_diff = value - lag(value)) %>%
  as_tsibble(index=date)

#| code-fold: true
sales_diff %>%
mutate(
    diff_sd = zoo::rollapply(
      value, 
      FUN = sd, 
      width = 12, 
      fill = NA)) %>%
ggplot()+
geom_line(aes(date,diff_sd))+
geom_smooth(aes(date,diff_sd),method='lm',se=F)+
theme_bw()+
ggtitle("Standard Deviation of Differenced Sales Price,  over Time") +
ylab("SD of Differenced Sales Price") +
xlab("Date")+
theme_bw()

`geom_smooth()` using formula = 'y ~ x'

After doing the rolling SD of differences series we can see negligible variance, based on the output we don’t need to do log transformation or BoxCox transformation.

Seasonal Differenced

As its a property sales price data and we have observed seasonality, in the next step we will check the seasonal difference. We are using a value difference of 1 year.

Code

sales_diff <- sales_diff %>% 
            mutate(value_seasonal = difference(value,52)
               )

sales_diff %>%
  gg_tsdisplay(value_seasonal,plot_type='partial', lag=36) +
  labs(title="Seasonally  differenced", y="")

Checking the stationary using KPSS test.

Code

seasonal_value_kpss = sales_diff %>% 
     features(value_seasonal, unitroot_kpss)
seasonal_value_kpss

# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.417      0.0697

The p-value is > 0.05, so we can say the seasonal value that we have derived is is stationary now.

Section 2

ACF and PACF plots using the seasonal data.

Code

par(mfrow = c(1, 2))
acf(sales_diff$value_seasonal, na.action = na.pass)
pacf(sales_diff$value_seasonal, na.action = na.pass)

We can see the dampening effect in ACF plot so AR order can be determined using the PACF significant lag which in this case is 2.

Based on the data from the lag plots and stationary we can assume ARIMA(2,0,0)

Section 3

Build and Compare Some Models by BIC, based on the below result set. Model 4 with ARIMA (2,0,0) is the best model.

Code

models_bic = sales_diff %>%
  model(
    mod1 = ARIMA(value~pdq(0,0,0)+PDQ(0,1,0)),
    mod2 = ARIMA(value~pdq(2,0,1)+PDQ(0,1,0)),
    mod3 = ARIMA(value~pdq(1,0,0)+PDQ(0,1,0)),
    mod4 = ARIMA(value~pdq(2,0,0)+PDQ(0,1,0)),
    mod5 = ARIMA(value~pdq(1,0,1)+PDQ(0,1,0)),
    mod6 = ARIMA(value~pdq(0,0,2)+PDQ(0,1,0)),
  )

models_bic %>%
  glance() %>%
  arrange(BIC)

# A tibble: 6 × 8
  .model      sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots 
  <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>   
1 mod4     44027366.  -2116. 4239. 4239. 4249. <cpl [2]> <cpl [0]>
2 mod5     45741846.  -2121. 4248. 4248. 4258. <cpl [1]> <cpl [1]>
3 mod2     45847100.  -2121. 4249. 4249. 4263. <cpl [2]> <cpl [1]>
4 mod3     87136953.  -2187. 4379. 4379. 4389. <cpl [1]> <cpl [0]>
5 mod6    331885343.  -2326. 4659. 4659. 4672. <cpl [0]> <cpl [2]>
6 mod1   3665291262.  -2573. 5149. 5149. 5156. <cpl [0]> <cpl [0]>

According to BIC model 4 ARIMA(2,0,0) is the best model,below we are deriving the fitted values and plotting them against the observed values

Code

best_model <- models_bic['mod4']
fitted <- best_model %>%  augment() %>% .$.fitted

ggplot() +
  geom_line(aes(sales_diff$date, sales_diff$value)) +
  geom_line(aes(sales_diff$date, fitted), color = "blue", alpha=0.5) +
  theme_bw() +
  xlab("Date") +
  ylab("Sale Price") +
  ggtitle("Observed vs Fitted values")

The in-sample data matches very closely to the observed values.

Section 4

Residuals and Forecasting

Analyzing the residuals from the selected model

Code

best_model %>%
  gg_tsresiduals(lag=52)

Box-Ljung test for residual autocorrelation

Code

best_model %>%
  augment() %>%
  features(.innov, ljung_box, lag = 10, dof = 2)

# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 mod4      63.9  8.00e-11

p-value for auto Box -Ljung test is less than 0.05, that means correlation exists and we don’t have optimum model.We can try auto arima to find out the best model 1st and then plot the forecast.

“Best” model using automated ARIMA function from fable

Code

best_model_1 <- sales_diff %>%
model(
  ARIMA(value,approximation=F)
) %>%
report()

Series: value 
Model: ARIMA(3,0,1)(1,1,0)[52] 

Coefficients:
         ar1     ar2      ar3     ma1     sar1
      0.9589  0.4572  -0.4421  0.9842  -0.5297
s.e.  0.0646  0.0935   0.0652  0.0187   0.0579

sigma^2 estimated as 26447684:  log likelihood=-2071.73
AIC=4155.46   AICc=4155.88   BIC=4175.46

6 time-period Seasonal ARIMA forecast

We will plot the next 6 weeks forecast using the auto arima model ARIMA(3,0,1)(1,1,0)

Code

best_model_1 %>%
  forecast(h=6) %>%
  autoplot(sales_diff)+
  ylab('Sales Price')+
  ggtitle("6 Weeks Forecast")+
  theme_bw()