BANA 7050 Assignment 3

Author

Ruowei Fischer

Section 1

Code
# Load necessary libraries
library(tidyverse)
library(tsibble)
library(fable)
library(feasts)

# Load data
df <- read_csv("../dataset/zillow_sales.csv")

# Convert to a tsibble
zillow_ts <- df %>%
  select(date, zillow_sales) %>%
  mutate(date = yearmonth(date)) %>%
  as_tsibble(index = date)

# Plot the original time series
zillow_ts %>%
  autoplot(zillow_sales) +
  labs(
    title = "Zillow Monthly Home Sales in Cincinnati",
    x = "Date (1 Month)",
    y = "Home Sales Count"
  )

From the plot we can see that there is a upward trend between 2011 - 2020 and then a downward trend due to the post Covid period. The suggests that variance and mean stationarity could be violated. There also seem to be seasonality in the data set. Next, I will be using some methods to exam non-stationarity.

Code
# Add rolling standard deviation (12-month window)
zillow_roll <- zillow_ts %>%
  mutate(roll_sd = slider::slide_dbl(zillow_sales, sd, .before = 6, .after = 6, .complete = TRUE))

zillow_rollsd <- zillow_roll %>%
  ggplot() +
  geom_line(aes(date, roll_sd)) +
  geom_smooth(aes(date,roll_sd),method='lm',se=F)+
  theme_bw() +
  ggtitle("Zillow Sales Standard Deviation over Time (12 month rolling window)") +
  ylab("Rolling SD") +
  xlab("Date")

zillow_rollsd

Data appears to be variance non-stationary

Code
# KPSS test for level stationarity
zillow_roll %>%
  features(zillow_sales, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1      1.74        0.01

The KPSS test before the transformation shows the p value of 0.01, which means we reject the null hypothesis - it is mean non-stationary.

Next, I will perform the variance transformation using the box-cox method to make variance stationary; since there is seasonality in the data set, I will also perform seasonal differencing; and finally, examine seasonally difference data for mean stationarity.

Code
# Apply Box-Cox transformation with estimated lambda
lambda_bc <- zillow_ts %>%
  features(zillow_sales, features = guerrero) %>%
  pull(lambda_guerrero)

zillow_boxcox <- zillow_ts %>%
  mutate(sales_boxcox = box_cox(zillow_sales, lambda_bc))

# Plot transformed data
zillow_boxcox %>%
  autoplot(sales_boxcox) +
  labs(
    title = "Box-Cox Transformed Zillow Sales",
    y = "Transformed Sales"
  )

Code
zillow_seasdiff <- zillow_boxcox %>%
  mutate(seasonal_diff = difference(sales_boxcox, lag = 12))

# Plot seasonally differenced data
zillow_seasdiff %>%
  autoplot(seasonal_diff) +
  labs(
    title = "Seasonally Differenced (lag=12) Zillow Sales",
    y = "Box-Cox Transformed & Seasonally Differenced"
  )

Code
zillow_final <- zillow_seasdiff %>%
  mutate(final_diff = difference(seasonal_diff, lag = 1))

# Plot fully differenced series
zillow_final %>%
  autoplot(final_diff) +
  labs(
    title = "Final Differenced Zillow Sales (Box-Cox + Seasonal + First-Order)",
    y = "Stationary Transformed Series"
  )

Code
# KPSS on final differenced series
zillow_final %>%
  features(final_diff, unitroot_kpss)
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1    0.0306         0.1

The new KPSS test on the final transformed difference data has a p value of 0.1, now the data is stationary. I will continue the model with this data.

Section 2

Plot the ACF and PACF

Code
# Plot ACF and PACF of final stationary series
zillow_final %>%
  ACF(final_diff) %>%
  autoplot() +
  labs(title = "ACF of Stationary Series", y = "ACF")

Code
zillow_final %>%
  PACF(final_diff) %>%
  autoplot() +
  labs(title = "PACF of Stationary Series", y = "PACF")

There is no clear damping in the ACF plot, which indicates it might be a MA process. From the PACF plot, we see that lag 2 has significant impact, and lag 1 also seem significant, so I am assuming it may be a MA2. It doesn’t indicate it’s an AR process, so I am predicting it’s AR0.

Lag 12 shows strong seasonality, which is expected. It suggests that the PDQ value may be (1,1,1)

Next, I will be fitting several ARIMA models with different parameters on the pdq value to see which model perform the best. Then I will conduct a auto fit to see if my prediction match the “best” result the auto fitting provides.

Section 3

Code
library(fabletools)

# Fit multiple ARIMA models
fit_models <- zillow_ts %>%
  model(
    ARIMA210111 = ARIMA(box_cox(zillow_sales,lambda_bc) ~ pdq(2,1,0) + PDQ(1,1,1)),
    ARIMA111111 = ARIMA(box_cox(zillow_sales,lambda_bc) ~ pdq(1,1,1) + PDQ(1,1,1)),
    ARIMA211111 = ARIMA(box_cox(zillow_sales,lambda_bc) ~ pdq(2,1,1) + PDQ(1,1,1)),
    ARIMA011011 = ARIMA(box_cox(zillow_sales,lambda_bc) ~ pdq(0,1,1) + PDQ(0,1,1))
  )
# Compare models
glance(fit_models) %>% arrange(BIC)
# A tibble: 4 × 8
  .model      sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots  
  <chr>        <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>    
1 ARIMA011011  0.112   -77.5  161.  161.  171. <cpl [0]>  <cpl [13]>
2 ARIMA111111  0.111   -74.0  158.  158.  174. <cpl [13]> <cpl [13]>
3 ARIMA210111  0.113   -76.3  163.  163.  179. <cpl [14]> <cpl [12]>
4 ARIMA211111  0.111   -74.0  160.  161.  180. <cpl [14]> <cpl [13]>

Based on the BIC score (lower the better), we can see that ARIMA (011, 011) performs the best. It matches the initial guess about AR process, as it doesn’t show strong damping, so it’s a AR0. Although, the lag 2 showed significance in PACF, the model indicates a MA1 instead of MA2.

Code
auto_mod = zillow_ts %>%
model(
  mod1 = ARIMA(box_cox(zillow_sales,lambda_bc),approximation=F,stepwise=F) # Takes a while for aproximation=F,stepwise=F to run
)

auto_mod %>%
report()
Series: zillow_sales 
Model: ARIMA(0,1,4)(2,1,0)[12] 
Transformation: box_cox(zillow_sales, lambda_bc) 

Coefficients:
          ma1      ma2      ma3      ma4     sar1     sar2
      -0.2579  -0.0849  -0.0163  -0.2410  -0.5307  -0.2966
s.e.   0.0729   0.0805   0.0718   0.0763   0.0784   0.0823

sigma^2 estimated as 0.1529:  log likelihood=-90.12
AIC=194.24   AICc=194.86   BIC=216.94

The auto ARIMA fit result is very different compared to my initial guess. While the AIC is higher, which indicates this is a better model, the BIC is also higher. It could be that BIC is penalizing complexity more harshly. For this case, I am going to continue with the manual model.

Code
best_model <- zillow_ts %>%
  model(best = ARIMA(box_cox(zillow_sales, lambda_bc) ~ pdq(0,1,1) + PDQ(0,1,1)))
# Get fitted values
fitted = best_model %>%
  augment() %>%
  .$.fitted

ggplot() +
  geom_line(aes(zillow_ts$date, zillow_ts$zillow_sales)) +
  geom_line(aes(zillow_ts$date, fitted), color = "blue", alpha = 0.4) +
  theme_bw() +
  xlab("Date") +
  ylab("Sales")

Based on the graphic, the fitted value (in blue) and observed value (in black) are not too far off. It still follows the trend and shows some seasonality.

Section 4

Code
# Extract residuals
best_model %>%
  gg_tsresiduals()

Residual appears to be white noise. From ACF there is still some significance in there, which indicates this model may not be picking up all the autocorrelations. The histogram looks normally distributed with few outliers.

Code
best_model %>%
  augment() %>%
  features(.innov, ljung_box, lag = 24, dof = 2) 
# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 best      59.7 0.0000244

Ljung-Box confirms that the residual appears to be white noise. Some of the significance may be fixed using a different method of transformation, ex. log transformation.

Code
forecast_6mo <- forecast(best_model, h = "6 months")

# Plot forecast vs historical data
forecast_6mo %>%
  autoplot(zillow_ts) +
  labs(title = "6-Month Forecast of Home Sales",
       y = "Sales Count", x = "Month")

The forecast looks pretty reasonable. It follows the existing pattern on the trend and seasonality, and stays within the expected ales range. The prediction intervals are appropriately sized, indicating a balanced tradoff between accuracy and uncertainty.