# Load necessary librarieslibrary(tidyverse)library(tsibble)library(fable)library(feasts)# Load datadf <-read_csv("../dataset/zillow_sales.csv")# Convert to a tsibblezillow_ts <- df %>%select(date, zillow_sales) %>%mutate(date =yearmonth(date)) %>%as_tsibble(index = date)# Plot the original time serieszillow_ts %>%autoplot(zillow_sales) +labs(title ="Zillow Monthly Home Sales in Cincinnati",x ="Date (1 Month)",y ="Home Sales Count" )
From the plot we can see that there is a upward trend between 2011 - 2020 and then a downward trend due to the post Covid period. The suggests that variance and mean stationarity could be violated. There also seem to be seasonality in the data set. Next, I will be using some methods to exam non-stationarity.
Code
# Add rolling standard deviation (12-month window)zillow_roll <- zillow_ts %>%mutate(roll_sd = slider::slide_dbl(zillow_sales, sd, .before =6, .after =6, .complete =TRUE))zillow_rollsd <- zillow_roll %>%ggplot() +geom_line(aes(date, roll_sd)) +geom_smooth(aes(date,roll_sd),method='lm',se=F)+theme_bw() +ggtitle("Zillow Sales Standard Deviation over Time (12 month rolling window)") +ylab("Rolling SD") +xlab("Date")zillow_rollsd
Data appears to be variance non-stationary
Code
# KPSS test for level stationarityzillow_roll %>%features(zillow_sales, unitroot_kpss)
The KPSS test before the transformation shows the p value of 0.01, which means we reject the null hypothesis - it is mean non-stationary.
Next, I will perform the variance transformation using the box-cox method to make variance stationary; since there is seasonality in the data set, I will also perform seasonal differencing; and finally, examine seasonally difference data for mean stationarity.
The new KPSS test on the final transformed difference data has a p value of 0.1, now the data is stationary. I will continue the model with this data.
Section 2
Plot the ACF and PACF
Code
# Plot ACF and PACF of final stationary serieszillow_final %>%ACF(final_diff) %>%autoplot() +labs(title ="ACF of Stationary Series", y ="ACF")
Code
zillow_final %>%PACF(final_diff) %>%autoplot() +labs(title ="PACF of Stationary Series", y ="PACF")
There is no clear damping in the ACF plot, which indicates it might be a MA process. From the PACF plot, we see that lag 2 has significant impact, and lag 1 also seem significant, so I am assuming it may be a MA2. It doesn’t indicate it’s an AR process, so I am predicting it’s AR0.
Lag 12 shows strong seasonality, which is expected. It suggests that the PDQ value may be (1,1,1)
Next, I will be fitting several ARIMA models with different parameters on the pdq value to see which model perform the best. Then I will conduct a auto fit to see if my prediction match the “best” result the auto fitting provides.
Based on the BIC score (lower the better), we can see that ARIMA (011, 011) performs the best. It matches the initial guess about AR process, as it doesn’t show strong damping, so it’s a AR0. Although, the lag 2 showed significance in PACF, the model indicates a MA1 instead of MA2.
Code
auto_mod = zillow_ts %>%model(mod1 =ARIMA(box_cox(zillow_sales,lambda_bc),approximation=F,stepwise=F) # Takes a while for aproximation=F,stepwise=F to run)auto_mod %>%report()
The auto ARIMA fit result is very different compared to my initial guess. While the AIC is higher, which indicates this is a better model, the BIC is also higher. It could be that BIC is penalizing complexity more harshly. For this case, I am going to continue with the manual model.
Based on the graphic, the fitted value (in blue) and observed value (in black) are not too far off. It still follows the trend and shows some seasonality.
Residual appears to be white noise. From ACF there is still some significance in there, which indicates this model may not be picking up all the autocorrelations. The histogram looks normally distributed with few outliers.
Code
best_model %>%augment() %>%features(.innov, ljung_box, lag =24, dof =2)
# A tibble: 1 × 3
.model lb_stat lb_pvalue
<chr> <dbl> <dbl>
1 best 59.7 0.0000244
Ljung-Box confirms that the residual appears to be white noise. Some of the significance may be fixed using a different method of transformation, ex. log transformation.
Code
forecast_6mo <-forecast(best_model, h ="6 months")# Plot forecast vs historical dataforecast_6mo %>%autoplot(zillow_ts) +labs(title ="6-Month Forecast of Home Sales",y ="Sales Count", x ="Month")
The forecast looks pretty reasonable. It follows the existing pattern on the trend and seasonality, and stays within the expected ales range. The prediction intervals are appropriately sized, indicating a balanced tradoff between accuracy and uncertainty.