Forecasting Zillow property prices

1. Introduction

1.1 Zillow’s Business Model

Zillow Group, Inc., or simply Zillow, is a leading real estate and rental marketplace founded in 2006 and headquartered in Seattle. Zillow is involved in buying, selling, renting, financing, remodeling of properties and also has a living database of more than 110 million U.S. homes, including

Homes for sale
Homes for rent &
Homes not currently on the market

Zillow also lists the prices of the properties that are not on the market. Zillow estimates or Zestimates as they are popularly known Zillow’s estimate of a home’s market value. The Zestimate incorporates public and user-submitted data, taking into account home facts, location and market conditions.

1.2 What is the need of the hour?

The Zestimate is marketed as a tool designed to take the mystery out of real estate for consumers who would otherwise have to rely on brokers and guesswork. However, Zestimates often give a single time point value which might not be an accurate way to base the decision of purchase on.

To make an informed purchase decision, investors or buyers need to be aware of the past trends and how the trends might affect the market henceforth.The purpose of this project is to address this want of the investors by leveraging the information at hand to evaluate the property prices up to the current time point and also understand the anticipated market trends for a reasonable period in the future

1.3 How do we go about it?

Currently the database at our disposal has a zip code level data for properties in the US. The price points are updated only till 2017.

Our objective here is to create a scalable product to forecast the prices of these properties up till 2020 (with a scope of additional forecasts for the future). For the purpose of this project, we are concentrating specifically on the 2-bedroom properties in New York city (NYC).

The ensuing report will showcase the results of 3 randomly selected zip codes in NYC (10013, 10011, 10003) and a step by step understanding of all the processes from cleaning and manipulation of the data to forecasting the data and performing adequacy checks. An R shiny app is also created to give a better visual and functional experience to the investors browsing the app.

2. Original Data Source

The dataset has been sourced from data.world.

The dataset has over 8900 zip codes of properties across the US. In this project, we are concentrating only on the 2 bhk properties listed under New York City. This sets the project scope to 25 zip codes.

#loading the required libraries

library(fpp)
library('tidyr')
library('dplyr')
library(astsa)
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(stringr)
library(formattable)
library(tidyverse)
library(tidyselect)



# Load data ---------------------------------------------------------------

zillow.data <- read.csv("Zip_Zhvi_2bedroom.csv", header=T)

3 Data Cleaning and transformation

3.1 Data Cleaning

Upon careful consideration of the data, we have decided to consider data points only from 2006 onwards.
Checking the NA values of the selected 25 zip codes has led to the removal of the zip code for Downtown Brooklyn (11201).

# Subsetting data  --------------------------------------------------------

zillow.data <- subset(zillow.data, City=='New York')
zillow.data <- subset(zillow.data, RegionID!=62012)  #removed since it has numerous NA's


# wide --> long -----------------------------------------------------------

zillow.long<-gather(zillow.data,Year,Price,-c("RegionID","RegionName","City","State","Metro","CountyName","SizeRank"),factor_key = TRUE)


# wrangling date column ---------------------------------------------------

zillow.long$Year<-substring(zillow.long$Year, 2)
zillow.long$month<- substr(zillow.long$Year, 6, nchar(zillow.long$Year))
zillow.long$Year<-substr(zillow.long$Year, 1, 4)

zillow.long$day<-1
zillow.long$date <- as.Date(with(zillow.long, paste(Year, month, day,sep="-")), "%Y-%m-%d")

drops <- c("Year","month","day")
zillow.long<-zillow.long[ , !(names(zillow.long) %in% drops)]


zillow.long$RegionID<-as.factor(zillow.long$RegionID)
zillow.long$RegionName<-as.factor(zillow.long$RegionName)
zillow.df<-zillow.long


# Sorting data ------------------------------------------------------------

zillow.df<-zillow.df[order(zillow.df$SizeRank),]


# Subsetting to consider data from 2006 ---------------------------------------

zillow.df<-zillow.df %>%
  group_by(RegionID) %>% 
  filter(date >= as.Date("2006-01-01"))

# Retaining only the required cols ----------------------------------------

zillow_subset<-zillow.df[,c('RegionName','Price','date')]
zillow_subset<-spread(zillow_subset,RegionName,Price)

3.2 Data Transformation

The data is then converted to time series data and further analysis is done using the “fpp”, “astsa” and “timeseries”.

The data then has to be converted to a time series data after selecting the relevant columns with each column representing the time series data of a particular zip code .

Given below is a snapshot of the same

# Printing the table

# Converting to timeseries ------------------------------------------------

zillow.forecast<-ts(zillow_subset,frequency = 12,start = 2006)

zillow.forecast<-zillow.forecast[,-1]

n<-dim(zillow.forecast)[2]
zillow.all<-zillow.forecast #can add all columns


kable(head(zillow.forecast,3), format = "html") %>%
        kable_styling(bootstrap_options = "striped") %>%
        column_spec(2, width = "12em")

10003	10011	10013	10014	10021	10022	10023	10025	10028	10036	10128	10303	10304	10305	10306	10308	10309	10312	10314	11215	11217	11231	11234	11434
1398900	1438800	1702200	1569000	1138000	1256800	1333000	854000	1262800	1109800	966800	267100	267900	341300	321600	355900	336500	321500	302200	498100	509200	501400	383600	323000
1431700	1473100	1701700	1592600	1154500	1289300	1356400	834800	1292600	1131300	972300	269500	270000	343200	324600	356700	338100	322600	303900	508200	510000	508400	393900	329000
1377700	1444500	1730800	1578800	1154500	1324300	1362000	821700	1284900	1149700	978900	272700	270900	346700	326100	357100	340300	323700	305400	514900	511100	513200	405000	336500

4 Exploratory Data Analysis

This section investigates the key insights and trends in three sample zip codes (10013, 10011, 10003), which would further buttress the methodology used to fit an adequate time series model for all zip codes in the dataset.

4.1 Identifying Homogeneous Non-Stationarity in the data

The below time series plot indicates that the mean and the variance of the time series data ( of median property cost) varies by zip code.

# Taking only 3 random zipcodes -------------------------------------------

zillow.3<-zillow.forecast[,c(1,2,3)]

plot(zillow.3)

The ACF , PACF plots are investigated and an ADF test is conducted to confirm this hypothesis. The plot in the next section are the ACF and PACF plots for the three sample zip codes which further indicates the non-stationarity of the times series data.

Augmented Dickey-Fuller Test

Null Hypothesis : The time series is non-stationary

Alternate Hypothesis: The time series is stationary

# Taking only 3 random zipcodes -------------------------------------------


p_value <- c(adf.test(zillow.3[,1])$p.value,adf.test(zillow.3[,2])$p.value,adf.test(zillow.3[,3])$p.value)
lag_order <- c(adf.test(zillow.3[,1])$parameter,adf.test(zillow.3[,2])$parameter,adf.test(zillow.3[,3])$parameter)
dickey_fuller <- c(adf.test(zillow.3[,1])$statistic,adf.test(zillow.3[,2])$statistic,adf.test(zillow.3[,3])$statistic)
zip_code <- c("10003","10011","10013")

#tab <- cbind(p,l,t)

dickey_fuller <- data.frame(zip_code,dickey_fuller,lag_order ,p_value)

kable(dickey_fuller, format = "html") %>%
        kable_styling(bootstrap_options = "striped") %>%
        column_spec(2, width = "12em")

zip_code	dickey_fuller	lag_order	p_value
10003	-1.2362355	5	0.8942708
10011	-0.7648497	5	0.9625075
10013	-2.0185709	5	0.5687385

At 95% confidence, we fail to reject the null hypothesis for all three zip codes. Hence it is confirmed that the time series data is not stationary in nature.

4.2 Identifying Seasonality in the Time Series Data

ACF and PACF Plots

The ACF and PACF plots of the zip codes show a seasonal pattern across time lags. This indicates that an ARIMA model would not successfully capture the seasonal component in the time series. Hence a seasonal ARIMA (SARIMA) model would be imperative to accurately forecast the median housing prices.

A key insight is that while the seasonal patterns across all three zip codes occur at every 12 lags, the pattern in itself is different for each zip code. This indicates that a single seasonal ARIMA (SARIMA) model would not be an adequate fit for all zip codes in the dataset.

# Plotting ACF -------------------------------------------

par(mfrow = c(2,2))

acf(zillow.3[,1], main = "ACF of zip code: 10003")
acf(zillow.3[,2], main = "ACF of zip code: 10011")
acf(zillow.3[,3], main = "ACF of zip code: 10013")

# Plotting PACF -------------------------------------------
par(mfrow = c(2,2))
pacf(zillow.3[,1], main = "PACF of zip code: 10003")
pacf(zillow.3[,2], main = "PACF of zip code: 10011")
pacf(zillow.3[,3], main = "PACF of zip code: 10013")

5. Modeling the Seasonal Time Series Data

The following section looks into identifying the best fit SARIMA model to predict the median housing price of a zip code. The fitted model is then tested for adequacy and the methodology is scaled to all zip codes in the data set.

5.1 Identifying SARIMA Model Hyperparameters Using Grid Search Algorithm

The exploratory data analysis section reveals that each zip code follows a seasonal pattern every 12 periods. Hence S (number of time steps for a single seasonal period) is kept at 12. A grid search algorithm is created to identify the

most optimal difference order (d),
autoregressive order(p),
moving average order(q) of the data.

Further the grid search algorithm also identifies the

most optimal seasonal autoregressive order (P) ,
seasonal moving average order (Q) and
seasonal difference order (D). The model hyperparameters that produce the lowest AIC value are chosen as the best fit model for a particular zip code. The model selected through grid search algorithm is then benchmarked with the auto arima model based on the respective AIC values.

We observe that grid search algorithm produces a better fit model when compared to the auto arima model (for all three zip codes).

5.2 Residual Analysis and Model Adequacy Test

The best fit model identified through grid search algorithm is then tested for adequacy using residual analysis and Ljung - Box test. The LJung-Box test results of the three sample zip codes have been shown below.

Zip code 10003

ts_10003 <-zillow.all[,1]
fit10003 <- Arima(ts_10003, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)

checkresiduals(fit10003)  #diagnostics

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 10.373, df = 18, p-value = 0.9191
## 
## Model df: 6.   Total lags used: 24

Zip code 10011

ts_10011 <-zillow.all[,2]
fit10011 <- Arima(ts_10011, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)

checkresiduals(fit10011)  #diagnostics

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 14.758, df = 18, p-value = 0.6785
## 
## Model df: 6.   Total lags used: 24

Zip code 10013

ts_10013 <-zillow.all[,3]
fit10013 <- Arima(ts_10013, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)

checkresiduals(fit10013)  #diagnostics

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 34.388, df = 18, p-value = 0.01128
## 
## Model df: 6.   Total lags used: 24

The Ljung-Box test confirms the adequacy of the three SARIMA models selected for the sample zip codes. The adequacy testing process is then scaled for all zip codes in the dataset ( each with its own best fit SARIMA model ).

5.3 Forecasting zip code wise median property price using the selected SARIMA model

As we can observe, the forecasted median property price trends differ significantly for each zip code. The median property price for each zip code is generated for April 2020.

Forecast for 10003

fit10003 %>% forecast(h=36) %>% autoplot()

Forecast for 10011

fit10011 %>% forecast(h=36) %>% autoplot()

Forecast for 10013

fit10013 %>% forecast(h=36) %>% autoplot()

6. Summary of the best fit SARIMA model for all zip codes in the dataset

The below table summarizes the best fit SARIMA model generated using grid search and further compares it with the model generated using auto arima in R. The model highlighted in green gives the lowest AIC value.

7. The app

An R Shiny app was developed that helps the client to self-serve the delivery of key insights. Please find the link to the app here

8. Summary of the Analysis

The analysis forecasts the median housing price for zip codes in New York City. The initial exploratory data analysis rendered insights into the non-stationary and seasonal nature of the time series data. It was also derived that a single seasonal ARIMA (SARIMA) model would not be an adequate fit for all zip codes. Hence the grid search algorithm was used to find the optimal SARIMA hyperparameters for each zip code. The model was then fit and tested for adequacy. The rendered model performed at par or above in comparison to the auto arima models. Additionally, the model was used to forecast the median housing price by zip code for April (2020). Finally, an R Shiny app was developed that helps clients self-serve the delivery of key insights. The app has the capability to accommodate different hyperparameter values to generate price forecasts.

References

Lectures

Forecasting Housing Prices in New York City ( Zillow listings )

Anjali, Ethan, Sanjay & Vishnu

4/28/2020