1 Preparations
- 1.1 Load libraries
- 1.2 Load data
2 Exploratory Data Analysis
3 Forecasting using an ARIMA model
4 Forecasting using a PROPHET model
5 Conclusion

This dataset from Zillow seemed interesting to me as I can see very interesting features. Also, the data has good depth. I think it shows us how the numbers differ before and after 2008 state-wise. I downloaded int from the Zillow researh data.

I tried to do some illustrative data visualizations with short analysis. I have also performed Linear regression on the New york state, ZHVI_Allhome prices.

Zillow publishes several different measures of home values monthly including median list prices, median sale prices, and the Zillow Home Value Index (ZHVI), but for almost all use cases we believe ZHVI to be the most representative measure of changing home values over time.

There’s a lot that goes into making the ZHVI. But the one-sentence explanation is that Zillow takes all estimated home values for a given region and month (Zestimates), takes a median of these values, applies some adjustments to account for seasonality or errors in individual home estimates, and then does the same across all months over the past 20 years and for many different geography levels (ZIP, neighborhood, city, county, metro, state, and country).

I decided to use ZHVI to track home values over time for the very simple reason that ZHVI represents the whole housing stock and not just the homes that list or sell in a given month. Imagine a month where no homes outside of California sold.

1 Preparations

1.1 Load libraries

library(dplyr)
library(ggplot2)
library(reshape)

## Warning: package 'reshape' was built under R version 4.1.2

library(wesanderson)

## Warning: package 'wesanderson' was built under R version 4.1.2

library(stringr)

## Warning: package 'stringr' was built under R version 4.1.2

library(doBy)

## Warning: package 'doBy' was built under R version 4.1.2

library(plotly)

## Warning: package 'plotly' was built under R version 4.1.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.2

library(wesanderson)
library(RColorBrewer)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.1.2

library(prophet)

## Warning: package 'prophet' was built under R version 4.1.2

library(plyr)

## Warning: package 'plyr' was built under R version 4.1.2

1.2 Load data

data=read.csv("C:/Users/neeya/Downloads/State_time_series.csv")
state<- read.csv("C:/Users/neeya/Downloads/State_time_series.csv")
state_new_york <- state %>%
  filter(RegionName=='NewYork') 

sprintf("The data set has %d rows and %d columns", nrow(state_new_york), ncol(state_new_york) )

## [1] "The data set has 261 rows and 82 columns"

2 Exploratory Data Analysis

The data set has 13212 rows and 82 columns. We will get an overview of the data set using the str, and summary.

We find :

As we can see from the empty entries in the table above, relatively a number of values are missing.

2.1 Structure

2.2 Summary

2.3 Missing Values

Now, we will investigate missing values in detail.

2.4 By feature

missing_value <- lapply(state_new_york, function(x) { round((sum(is.na(x)) / length(x)) * 100, 1) })
melt(data.frame(missing_value))%>%
    ggplot(aes(x= reorder(variable, -value), y= value))+
    geom_col(width=1, fill= "darkred", alpha=0.7 )+
    labs(y="Missing pct(%)", x=NULL, title = "Missing Values by feature")+
    theme(axis.ticks.x=element_line(colour="gray90"),
          axis.text.y=element_text(size=5.5))+
    coord_flip()

2.5 Individual feature visualisations

Now we are ready to start visualizing the data. Let’s start with some overview plots for individual features, to explore the data set.

library(lubridate)
data$Year<-year(ymd(data$Date))

Now we are ready to start visualizing the data. Let’s start with some overview plots for individual features, to explore the data set.

2.5.1 How is this dependent variable(ZHVI All Homes) distributed?

state_new_york %>%
  ggplot(aes(x=ZHVI_AllHomes))+
  geom_histogram()

na_count <- function(x){sum(is.na(x))}

names_of_columns_to_keep <- unlist(apply(X=state_new_york,FUN = na_count,MARGIN = 2))
names_of_columns_to_keep <- names_of_columns_to_keep[names_of_columns_to_keep==0]
names_of_columns_to_keep <- names(names_of_columns_to_keep)

names_of_columns_to_keep

## [1] "Date"                "RegionName"          "ZHVI_2bedroom"      
## [4] "ZHVI_3bedroom"       "ZHVI_5BedroomOrMore" "ZHVI_BottomTier"

As you can see, the ZHVI values are right-skewed. It means we might need to transform this variable and make it more normally distributed before modeling.

2.5.2 ZHVI value- Growth by Year

attach(state_new_york)

state_new_york$Year<-year(ymd(state_new_york$Date))
MSP <- aggregate(ZHVI_AllHomes ~Year, state_new_york, mean)

sl1 <-ggplot(MSP, aes(x=as.factor(Year), y=ZHVI_AllHomes))+
  geom_line(color= "green", aes(group=1), size=1.5)+
  geom_point(color= "green", size = 3.5, alpha=0.5)+
  labs(title="The Growth of Sale Prices by year", x=NULL, y="ZHVI All Homes Price")+
  theme( plot.title=element_text(vjust=3, size=15) ) + theme_minimal()
sl1

MSP$rate = c(0, 100*diff(MSP$ZHVI_AllHomes)/MSP[-nrow(MSP),]$ZHVI_AllHomes)

sl2 <-ggplot(MSP, aes(x=as.factor(Year), y=rate))+
  geom_line(color= "gray50", aes(group=1), size=1)+ labs(title="Change rate of ZHVI All Homes Price", x="Year", y="rate of change")+
  geom_hline(yintercept = 0, color = "green" )+
  theme(plot.title=element_text(size=15))+ theme_minimal()

grid.arrange(sl1,sl2)

2.6 Multi-parameter visualisation

2.6.1 Correlation plot; ZHVI_All_Homes, Sales_Counts, price and counts

subset(data, select=c(Sale_Counts, Sale_Prices, ZHVI_AllHomes))%>%
    na.omit()%>% cor()%>% 
    corrplot.mixed(lower = "number", upper = "ellipse", tl.cex=0.7,
                   cl.ratio=0.2, cl.cex=0.6, tl.col="black")

We can guess that house value for sale and rent will go the same way.

2.6.2 Correlation plot; ZHVI_All_Homes,ZRI and Days

subset(data, select=c(DaysOnZillow_AllHomes, ZHVI_AllHomes, ZRI_AllHomes))%>%
    na.omit()%>%
    cor()%>%
    corrplot.mixed(lower = "number", upper = "ellipse", tl.cex=0.6, cl.ratio=0.2,
                   cl.cex=0.6, tl.col="black")

2.6.3 Days on Zillow

Changes of House values- Let’s have a look, how these values change according to the Days on Zillow

aggregate(ZHVI_AllHomes ~DaysOnZillow_AllHomes, data, mean)%>%
  ggplot(aes(x=DaysOnZillow_AllHomes, y=ZHVI_AllHomes))+
  geom_line(col=wes_palette("GrandBudapest2")[1])+
  labs(title="House Sale Value Decreases", x=NULL, y=NULL)+
    stat_smooth(method="loess", color="#0F327B", alpha=0.3)+
  theme_minimal()

aggregate(ZHVI_AllHomes ~DaysOnZillow_AllHomes, data, mean)%>%
  ggplot(aes(x=DaysOnZillow_AllHomes, y=ZHVI_AllHomes))+
  geom_line(col=wes_palette("GrandBudapest2")[4])+
  labs(title="House Rent Value Decrease", x=NULL, y=NULL)+
   stat_smooth(method="loess", color="#0F327B", alpha=0.3)+
  theme_minimal()

As we can see, both values tend to decrease with the passage of days.

3 Forecasting using an ARIMA model

Steps to be followed for ARIMA modeling:

Exploratory analysis
Fit the model
Diagnostic measures

3.1 Behaviour of the Series

# Trend chart
monthly_price_plot = ggplot(state_new_york)+
  geom_line(aes(Year,ZHVI_AllHomes))+
  xlab("Date")+ylab("ZHVI value for New York Housing")+labs(title = 'ZHVI value for New York Housing',subtitle = 'January 2008 - January 2016')+theme_bw()

#monthly_price_plot

monthly_price_plot + 
  geom_smooth(aes(Year,ZHVI_AllHomes),method='lm',color='orange')

The trend chart indicates a drop in the ZHVI value from year 2008 this is because of the impact of global recession. The housing value started to increase from 2013 and we can see it kept on increasing.

We need to understand the three components of a time series data. I used the following R code to find out the components of this time series:

train2 = na.omit(with(data,data.frame( ds = Date, y= log(ZHVI_AllHomes), rn = RegionName)))
m_train = aggregate(y ~ds, train2, mean)
NY<-ts(train2$y[train2$rn=="NewYork"], start=c(1996,04), freq=12)
plot(decompose(NY), color)

Here we get 4 components:

Observed – the actual data plot Trend – the overall upward or downward movement of the data points Seasonal – any monthly/yearly pattern of the data points Random – unexplainable part of the data Observing these 4 graphs closely, we can find out if the data satisfies all the assumptions of ARIMA modeling, mainly, stationarity and seasonality.

Various plots and functions that help in detecting seasonality:

A seasonal subseries plot

Multiple box plot
Auto correlation plot
ndiffs() is used to determine the number of first differences required to make the time series non-seasonal

library(forecast)
NY_adjusted <- NY- (decompose(NY))$seasonal
ndiffs(NY_adjusted)

## [1] 2

NY_adjusted

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1996                            12.71835 12.71735 12.71655 12.71538 12.70861
## 1997 12.67761 12.67041 12.66378 12.65242 12.63977 12.62817 12.61706 12.60180
## 1998 12.53003 12.51731 12.51084 12.50698 12.50360 12.50155 12.49986 12.49913
## 1999 12.47925 12.47583 12.47249 12.46811 12.46346 12.45900 12.45571 12.45258
## 2000 12.43812 12.43653 12.43500 12.43087 12.42686 12.42144 12.41643 12.41192
## 2001 12.39851 12.39811 12.40018 12.40005 12.40088 12.40150 12.40175 12.40252
## 2002 12.40591 12.40836 12.41328 12.41477 12.41558 12.41781 12.42046 12.42284
## 2003 12.42736 12.43056 12.43619 12.43962 12.44158 12.44296 12.44475 12.44748
## 2004 12.45773 12.46007 12.46441 12.46811 12.47307 12.47708 12.47955 12.48373
## 2005 12.49659 12.49884 12.50307 12.50623 12.51173 12.51814 12.52522 12.53358
## 2006 12.60721 12.61588 12.61911 12.62122 12.62542 12.62981 12.63315 12.63685
##           Sep      Oct      Nov      Dec
## 1996 12.69821 12.69150 12.68866 12.68386
## 1997 12.58321 12.56830 12.55685 12.54438
## 1998 12.49872 12.49639 12.48984 12.48340
## 1999 12.44894 12.44528 12.44241 12.43965
## 2000 12.40725 12.40505 12.40416 12.40133
## 2001 12.40356 12.40423 12.40375 12.40421
## 2002 12.42551 12.42737 12.42728 12.42650
## 2003 12.45051 12.45313 12.45458 12.45652
## 2004 12.48972 12.49377 12.49548 12.49619
## 2005 12.54400 12.55567 12.56980 12.58932
## 2006

NY

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1996                            12.71830 12.71770 12.71710 12.71650 12.70897
## 1997 12.67826 12.67106 12.66318 12.65236 12.64013 12.62872 12.61818 12.60216
## 1998 12.53069 12.51796 12.51024 12.50692 12.50395 12.50209 12.50098 12.49949
## 1999 12.47991 12.47648 12.47189 12.46805 12.46381 12.45955 12.45683 12.45293
## 2000 12.43877 12.43718 12.43440 12.43081 12.42721 12.42199 12.41755 12.41227
## 2001 12.39917 12.39876 12.39958 12.39999 12.40123 12.40205 12.40287 12.40287
## 2002 12.40656 12.40901 12.41268 12.41471 12.41593 12.41836 12.42159 12.42320
## 2003 12.42802 12.43121 12.43560 12.43956 12.44193 12.44351 12.44588 12.44784
## 2004 12.45839 12.46071 12.46381 12.46805 12.47342 12.47763 12.48067 12.48408
## 2005 12.49725 12.49949 12.50247 12.50618 12.51209 12.51869 12.52634 12.53394
## 2006 12.60786 12.61653 12.61851 12.62116 12.62577 12.63036 12.63428 12.63721
##           Sep      Oct      Nov      Dec
## 1996 12.69710 12.68973 12.68819 12.68417
## 1997 12.58211 12.56654 12.55638 12.54469
## 1998 12.49762 12.49463 12.48937 12.48370
## 1999 12.44784 12.44351 12.44193 12.43996
## 2000 12.40615 12.40328 12.40369 12.40164
## 2001 12.40246 12.40246 12.40328 12.40451
## 2002 12.42440 12.42561 12.42681 12.42681
## 2003 12.44941 12.45137 12.45410 12.45683
## 2004 12.48862 12.49200 12.49500 12.49650
## 2005 12.54290 12.55391 12.56933 12.58963
## 2006

3.2 Transformation for Variance Stationarity

ZHVIdata_boxcox = state_new_york %>%
  mutate(avg_ZHVI_boxcox = forecast::BoxCox(ZHVI_AllHomes,lambda='auto'))
ZHVIdata_boxcox %>%
  ggplot()+
      geom_line(aes(Year,avg_ZHVI_boxcox))+
      theme_bw()+
      ggtitle("ZHVI value over Time (Box-Cox Transformation)")+
      ylab("ZHVI value (Box-Cox Tranformation)")+
      xlab("Date")

After the box cox transformation to resolve any variance non-stationary behavior, the graph reveals no change visually. Hence, it can be concluded that the time series was originally variance stationary and does not require any transformation for variance stabilization.

ZHVIdata_diff = state_new_york%>%
  mutate(avg_ZHVI_boxcox = forecast::BoxCox(ZHVI_AllHomes,lambda='auto')) %>%
  mutate(ZHVI_diff_box = avg_ZHVI_boxcox - lag(avg_ZHVI_boxcox)) %>%
  mutate(ZHVI_diff = ZHVI_AllHomes - lag(ZHVI_AllHomes))
ZHVIdata_diff %>%
  ggplot()+
      geom_line(aes(ZHVI_AllHomes,ZHVI_diff))+
      theme_bw()+
      ggtitle("ZHVI (First Difference)")+
      ylab("Avg ZHVI(Difference))")+
       xlab("Date")

The transformation of average price using the first difference seems to have brought the mean of the time series to 0. The next step is to test original as well as transformed variable to ascertain stationarity/non-stationarity. For this we will be using the Augmented Dicky-Fuller test and the KPSS test.

The autocorrelation function (acf()) gives the autocorrelation at all possible lags. The autocorrelation at lag 0 is included by default which always takes the value 1 as it represents the correlation between the data and themselves.

3.3 ADF and KPSS Tests for Stationarity

library(tseries)


adf.test(NY)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  NY
## Dickey-Fuller = -1.1151, Lag order = 4, p-value = 0.9167
## alternative hypothesis: stationary

kpss.test(NY)

## 
##  KPSS Test for Level Stationarity
## 
## data:  NY
## KPSS Level = 0.76198, Truncation lag parameter = 4, p-value = 0.01

For the original price variable, the ADF test has showed a p value greater than 0.05 and the KPSS test value has a p value less than 0.05. This indicates that the ZHVI value is not mean stationary.

par(mfrow = c(1,2))
acf(NY,lag.max=40)
pacf(NY,lag.max=40)

As we can infer from the acf graph above, the autocorrelation continues to exponentially decrease as the lag increases, confirming that there is no linear association between observations separated by larger lags.

pacf() at lag k is autocorrelation function which describes the correlation between all data points that are exactly k steps apart- after accounting for their correlation with the data between those k steps. It helps to identify the number of autoregression (AR) coefficients(p-value) in an ARIMA model.

3.4 ARIMA

To determine whether it is a mixed model and the order of the auto-regression and moving average elements that best fits the time series, we try a few different variants as below.

fitARIMA <- arima(NY, order=c(1,1,1),seasonal = list(order = c(1,0,0), period = 12),method="ML")
library(lmtest)
coeftest(fitARIMA)

## 
## z test of coefficients:
## 
##      Estimate Std. Error z value  Pr(>|z|)    
## ar1  0.852615   0.045789 18.6206 < 2.2e-16 ***
## ma1  0.664270   0.052247 12.7141 < 2.2e-16 ***
## sar1 0.376175   0.105187  3.5763 0.0003485 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(fitARIMA)

##          2.5 %    97.5 %
## ar1  0.7628706 0.9423589
## ma1  0.5618685 0.7666711
## sar1 0.1700133 0.5823377

3.5 Choosing the best model

I picked the model with the lowest AIC values.

mod1 = arima(NY,order=c(0,1,1))
mod2 = arima(NY,order=c(0,1,0))
mod3 = arima(NY,order=c(1,1,0))
summary(mod1)

## 
## Call:
## arima(x = NY, order = c(0, 1, 1))
## 
## Coefficients:
##          ma1
##       0.8836
## s.e.  0.0276
## 
## sigma^2 estimated as 1.232e-05:  log likelihood = 524.16,  aic = -1044.32
## 
## Training set error measures:
##                         ME        RMSE         MAE          MPE      MAPE
## Training set -0.0002381121 0.003676344 0.002659507 -0.001881159 0.0212097
##                  MASE      ACF1
## Training set 0.589279 0.6812555

summary(mod2)

## 
## Call:
## arima(x = NY, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 3.851e-05:  log likelihood = 454.25,  aic = -906.51
## 
## Training set error measures:
##                        ME        RMSE         MAE          MPE       MAPE
## Training set -0.000546988 0.006284733 0.004578796 -0.004329216 0.03650425
##                  MASE      ACF1
## Training set 1.014544 0.9006195

summary(mod3)

## 
## Call:
## arima(x = NY, order = c(1, 1, 0))
## 
## Coefficients:
##          ar1
##       0.9303
## s.e.  0.0301
## 
## sigma^2 estimated as 4.757e-06:  log likelihood = 582.91,  aic = -1161.83
## 
## Training set error measures:
##                       ME        RMSE         MAE          MPE       MAPE
## Training set 8.11102e-05 0.002452145 0.001594108 0.0006579105 0.01270284
##                   MASE      ACF1
## Training set 0.3532137 0.2761833

We will run an auto.arima to help determine best model. Auto ARIMA says MA3 performs the best, with 2 orders of differencing Note that KPSS test suggested that a second difference may be preferred as well (still non-stationary after one difference)

mod_auto = auto.arima(NY_adjusted,stepwise=FALSE,approximation=FALSE,stationary=FALSE,max.order = 6, max.p = 7,max.q = 5)
summary(mod_auto)

## Series: NY_adjusted 
## ARIMA(2,2,0)(1,0,1)[12] 
## 
## Coefficients:
##          ar1      ar2    sar1     sma1
##       0.5601  -0.5034  0.8616  -0.6187
## s.e.  0.0810   0.0782  0.0969   0.1589
## 
## sigma^2 = 5.661e-06:  log likelihood = 608.54
## AIC=-1207.08   AICc=-1206.57   BIC=-1193.02
## 
## Training set error measures:
##                         ME        RMSE         MAE           MPE       MAPE
## Training set -5.713919e-05 0.002321414 0.001381287 -0.0004299887 0.01101248
##                    MASE       ACF1
## Training set 0.02616705 -0.0994609

Auto.arima above also reveals that the ARIMA(2,2,0) model is the best fit based on lowest AIC and BIC value. Therefore, we proceed with the model ARIMA(2,2,0) for residual and in-sample performance testing.

3.6 Residuals and In-Sample Performance Test

best_mod = arima(NY_adjusted,order=c(2,2,0))
checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,0)
## Q* = 14.512, df = 22, p-value = 0.8823
## 
## Model df: 2.   Total lags used: 24

3.7 Box-Ljung test

It is a test of independence at all lags up to the one specified. Instead of testing randomness at each distinct lag, it tests the “overall” randomness based on a number of lags It is applied to the residuals of a fitted ARIMA model, not the original series, and in such applications the hypothesis actually being tested is that the residuals from the ARIMA model have no autocorrelation.

R code to obtain the box test results:

checkresiduals(mod_auto)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,0)(1,0,1)[12]
## Q* = 16.137, df = 20, p-value = 0.7081
## 
## Model df: 4.   Total lags used: 24

Box.test(mod_auto$residuals,type = 'Ljung-Box')

## 
##  Box-Ljung test
## 
## data:  mod_auto$residuals
## X-squared = 1.2665, df = 1, p-value = 0.2604

Box.test(mod_auto$residuals,type = 'Ljung-Box',lag=5)

## 
##  Box-Ljung test
## 
## data:  mod_auto$residuals
## X-squared = 6.6455, df = 5, p-value = 0.2484

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

The residuals of the selected ARIMA model visually resemble a white noise process. However, from the histogram, the residuals do not exactly follow a normal distribution.

The ACF and PACF plots show that the autocorrelation function for the series is generally very close to 0.

3.8 Forecasting using an ARIMA model

The parameters of that ARIMA model can be used as a predictive model for making forecasts for future values of the time series once the best-suited model is selected for time series data. There is a function called predict() which is used for predictions from the results of various model fitting functions. It takes an argument n.ahead() specifying how many time steps ahead to predict.

model<-predict(mod_auto,n.ahead = 24)
library(forecast)
futurVal <- forecast(mod_auto,level=c(99.5))
plot(futurVal)

The forecasts are shown as a blue line, with the 80% prediction intervals as a dark shaded area, and the 95% prediction intervals as a light shaded area.

The predicted values are increasing a little over the years. Thus, the conclusion is that the projected values look reasonable in the period which we have forecast for.

3.9 Cross Validation and Error Metrics- ARIMA

library(data.table)
xlist <-c(1,13,25,37,49,61,73,85,97)

out_list = lapply(xlist,function(x){
  train =state_new_york[x:(x+59),]
  print(train)
  test = state_new_york[(x+60):(x+60+11),]
  print(test)
 
  best_mod_train = arima(NY_adjusted,order=c(2,2,0))
  out_data = data.frame(
    train_thru = x,
    time_ahead = 1:12,
    time_point = (x+59)+ 1:12,
    actual = test$ZHVI_AllHomes,
    naive_fcst = train$ZHVI_AllHomes[x],
    
    test_pred= predict(best_mod_train,12)$pred,
   
    error = test$ZHVI_AllHomes - predict(best_mod_train,12)$pred
  )
 })
out_data = rbindlist(out_list)

out_data%>% ggplot()+
geom_point(aes(time_point,actual)) +
  geom_point(aes(time_point,test_pred,colour = factor(train_thru)))+xlab("Time Ahead") +theme_bw()

I calculated the in sample values as the rolling windows values are not accurate.

in_sample_pred = mod_auto$residuals+NY

RMSE = sqrt(mean((in_sample_pred - NY)^2,na.rm=T))
RMSE

## [1] 0.002321414

MAE = mean(abs(in_sample_pred - NY))
MAE

## [1] 0.001381287

MAPE = mean(abs((in_sample_pred - NY)/NY))
MAPE

## [1] 0.0001101245

4 Forecasting using a PROPHET model

The purpose of using Prophet is to: * Identify seasonal patterns in the data * Model “change points” — or periods of significant structural change in the data * Forecast future ZHVI values in NY using seasonal and change point parameters

In this regard, Prophet can potentially produce superior results to more traditional time series models such as ARIMA by identifying structural breaks in a time series and making forecasts by taking change points as well as seasonality patterns into account.


## Model Building


```r
train2 = na.omit(with(data,data.frame( ds = Date, y= log(ZHVI_AllHomes), rn = RegionName)))


n_train<-subset(train2, rn=="NewYork")

train = n_train %>% 
  filter(ds<ymd("2014-01-31"))

test = n_train %>%
  filter(ds>=ymd("2014-01-31"))

model = prophet(train,weekly.seasonality=FALSE, daily.seasonality=FALSE)

The model is then used to predict 24 months forward, with the predictions compared to the test set (actual values).

future = make_future_dataframe(model,periods = 24, freq="month")

# Forecasting
forecast = predict(model,future)

plot(model,forecast)+
ylab("ZHVI value for New York Housing")+ xlab("Date")+labs(title = 'Forecast: ZHVI value for New York Housing',subtitle = 'January 2008 - January 2016')+theme_bw()

4.1 Visualizing the Components

Very similar to general time series decomposition

prophet_plot_components(model,forecast)

4.2 Plotting Changepoints

plot(model,forecast)+add_changepoints_to_plot(model)+xlab("Date")+ylab("ZHVI value for New York Housing")+labs(title = 'Forecast: ZHVI value for New York Housing',subtitle = 'January 2008 - January 2016')+theme_bw()

4.3 Saturation Point

The model doesn’t need to consider the minimum/maximum point.

4.4 Seasonality

Prophet can handle multiple types of seasonality simultaneously

For example – day of week AND yearly seasonality

Yearly seasonality estimated as smooths, we can control “wiggliness”

prophet_plot_components(model,forecast)

4.5 Types- Additive and multiplicative seasonality

Additive Seasonality

Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.

additive = prophet(train)

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

add_fcst = predict(additive,future)

plot(additive,add_fcst)+
ylim(12.3,12.8)+xlab("Date")+ylab("ZHVI value for New York Housing")+labs(title = 'Forecast: ZHVI value for New York Housing',subtitle = 'Additive Seasonality Model')+theme_bw()

prophet_plot_components(additive,add_fcst)

Multiplicative Seasonality

Multiplicative seasonality means there are changes to widths or heights of seasonal periods over time.

multi = prophet(train,seasonality.mode = 'multiplicative')

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

multi_fcst = predict(multi,future)

plot(multi,multi_fcst)+ylim(12.3,12.8)+xlab("Date")+ylab("ZHVI value for New York Housing")+labs(title = 'Forecast: ZHVI value for New York Housing',subtitle = 'Multiplicative Seasonality Model')+theme_bw()

prophet_plot_components(multi, multi_fcst)

4.6 Comparison of Additive versus Multiplicative Fit for Seasonality

mod1 = prophet(train,seasonality.mode='additive')

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.

## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

forecast1 = predict(mod1)
df_cv1 <- cross_validation(mod1, horizon=365,period = 365,initial = 4*365,units='days')

## Making 2 forecasts with cutoffs between 2012-01-01 and 2012-12-31

metrics1 = performance_metrics(df_cv1) %>% 
  mutate(model = 'mod1')

mod2 = prophet(train,seasonality.mode='multiplicative')

## Disabling weekly seasonality. Run prophet with weekly.seasonality=TRUE to override this.
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.

forecast2 = predict(mod2)
df_cv2 <- cross_validation(mod2, horizon=365,period = 365,initial = 4*365,units='days')

## Making 2 forecasts with cutoffs between 2012-01-01 and 2012-12-31

metrics2 = performance_metrics(df_cv2) %>% 
  mutate(model = "mod2")
metrics1 %>% 
bind_rows(metrics2) %>% 
ggplot()+
geom_line(aes(horizon,rmse,color=model)) +labs(title = 'RMSE Comparison for Additive and Multiplicative Model Fits',subtitle ='ZHVI value for New York Housing') + theme_bw()

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

The predictions are closely following the actual values. From the plots it is clear that there is no seasonality.

4.7 Assessing Prophet Model

We calculate the error metrics to assess model performance.

forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.Date(ds)) %>% 
  filter(ds>=ymd("2014-01-31"))

RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

MAE = mean(abs(test$y - forecast_metric_data$yhat))

MAPE = mean(abs((test$y - forecast_metric_data$yhat)/test$y))

print(paste("RMSE:",round(RMSE,2)))

## [1] "RMSE: 0.09"

print(paste("MAE:",round(MAE,2)))

## [1] "MAE: 0.07"

print(paste("MAPE:",round(MAPE,2)))

## [1] "MAPE: 0.01"

RMSE-0.09, MAE-0.07 and MAPE-0.01 are the values for the model.

The Prophet model provides an acceptable fit for the time series data, even though the frequency is at month level. The model has acceptable ranges of mean absolute percent error and rmse and has no seasonality effect, thus enabling more accurate forecasts.

4.8 Cross-Validation with Prophet

We’ve gone over the basics of each of the models as well as assessing the model fit, let’s compare how well the models predict future Housing. This cross validation is performed by assigning a rolling window in our time series. We split this window into two pieces, the “initial” time period and the “horizon”. We fit our model using the initial time period and compare our prediction of the horizon to its actual values. I picked the RMSE as my loss function in evaluating predictive performance.

A typical comparison is to compute the RMSE for each of the days in your horizon by combining all the differences between ‘y’ and ‘yhat’ from each of your rolling validations:

df.cv <- cross_validation(model, initial = 2*365, period = 365, horizon =365 , units = 'days')

## Making 4 forecasts with cutoffs between 2010-01-01 and 2012-12-31

## n.changepoints greater than number of observations. Using 22

head(df.cv)

##          y         ds     yhat yhat_lower yhat_upper     cutoff
## 1 12.49762 2010-01-31 12.50480   12.49859   12.51179 2010-01-01
## 2 12.49463 2010-02-28 12.53064   12.50755   12.55425 2010-01-01
## 3 12.48937 2010-03-31 12.48530   12.43736   12.52846 2010-01-01
## 4 12.48370 2010-04-30 12.42613   12.35321   12.49757 2010-01-01
## 5 12.47991 2010-05-31 12.35893   12.24819   12.46315 2010-01-01
## 6 12.47648 2010-06-30 12.33158   12.17839   12.48075 2010-01-01

unique(df.cv$cutoff)

## [1] "2010-01-01 GMT" "2011-01-01 GMT" "2012-01-01 GMT" "2012-12-31 GMT"

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("ZHVI value")+
  scale_color_discrete(name = 'Cutoff')+
  theme(legend.position = "None")+
  labs(
    title = 'Forecast: ZHVI value for New York Housing ',
    subtitle = 'Cross Validation'
  )

Cross-Val Actual vs Pred

df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("ZHVI value for NY Housing")+
  scale_color_discrete(name = 'Cutoff')

4.9 Error Metrics

performance_metrics(df.cv)

##     horizon          mse       rmse         mae        mape        mdape
## 1   31 days 0.0001359082 0.01165797 0.009103807 0.000731372 0.0005576369
## 2   58 days 0.0003927932 0.01981901 0.015140145 0.001213968 0.0005893709
## 3   59 days 0.0003675844 0.01917249 0.014337326 0.001150119 0.0007183601
## 4   89 days 0.0009083815 0.03013937 0.019171640 0.001541431 0.0005624404
## 5   90 days 0.0009152155 0.03025253 0.019380475 0.001558253 0.0006218309
## 6  119 days 0.0018396065 0.04289063 0.033576909 0.002695907 0.0027648629
## 7  120 days 0.0018613754 0.04314366 0.034469471 0.002767850 0.0028968471
## 8  150 days 0.0047052348 0.06859471 0.050427795 0.004047070 0.0031201213
## 9  151 days 0.0047329537 0.06879647 0.051550584 0.004137550 0.0032444769
## 10 180 days 0.0058226772 0.07630647 0.052863429 0.004242421 0.0024943579
## 11 181 days 0.0058786975 0.07667266 0.055125564 0.004424739 0.0026463867
## 12 211 days 0.0048497738 0.06964032 0.052980908 0.004254677 0.0030662331
## 13 212 days 0.0048967530 0.06997680 0.053879069 0.004326881 0.0032364178
## 14 242 days 0.0050495423 0.07106013 0.056031370 0.004501323 0.0037208178
## 15 243 days 0.0050686195 0.07119424 0.056705520 0.004555583 0.0037476571
## 16 272 days 0.0052211294 0.07225738 0.057343788 0.004608028 0.0037559968
## 17 273 days 0.0052491075 0.07245072 0.058389183 0.004692251 0.0037571342
## 18 303 days 0.0055618857 0.07457805 0.059522083 0.004784551 0.0037267251
## 19 304 days 0.0056444492 0.07512955 0.061595540 0.004951589 0.0037628171
## 20 333 days 0.0056945320 0.07546212 0.060290013 0.004847134 0.0033217254
## 21 334 days 0.0058022483 0.07617249 0.062323642 0.005010890 0.0034190997
## 22 364 days 0.0060310664 0.07765994 0.062355625 0.005014448 0.0032138761
## 23 365 days 0.0062223591 0.07888193 0.065312807 0.005252647 0.0035028637
##           smape coverage
## 1  0.0007317284     0.00
## 2  0.0012130232     0.25
## 3  0.0011491212     0.50
## 4  0.0015443720     0.75
## 5  0.0015612161     0.75
## 6  0.0027018468     0.75
## 7  0.0027738604     0.75
## 8  0.0040622635     0.50
## 9  0.0041528339     0.75
## 10 0.0042612381     1.00
## 11 0.0044437382     1.00
## 12 0.0042703561     1.00
## 13 0.0043427121     1.00
## 14 0.0045176614     1.00
## 15 0.0045719831     1.00
## 16 0.0046249333     1.00
## 17 0.0047092465     1.00
## 18 0.0048025739     1.00
## 19 0.0049698803     1.00
## 20 0.0048655945     1.00
## 21 0.0050297006     1.00
## 22 0.0050340129     1.00
## 23 0.0052728341     1.00

plot_cross_validation_metric(df.cv, metric = 'rmse')

plot_cross_validation_metric(df.cv, metric = 'mape')

plot_cross_validation_metric(df.cv, metric = 'mae')

5 Conclusion

Out of all the models, I would probably decide on prophet model in forecasting future data.To further improve the results we would need more historical data. The problem statement of this particular project was accomplished using the forecasting methods-ARIMA and PROPHET.

THANK YOU

Forecasting and EDA on NY-ZHVI_AllHomes

Neeyati Anand