Zillow Group, Inc., or simply Zillow, is a leading real estate and rental marketplace founded in 2006 and headquartered in Seattle. Zillow is involved in buying, selling, renting, financing, remodeling of properties and also has a living database of more than 110 million U.S. homes, including
Homes for sale
Homes for rent &
Homes not currently on the market
Zillow also lists the prices of the properties that are not on the market. Zillow estimates or Zestimates as they are popularly known Zillow’s estimate of a home’s market value. The Zestimate incorporates public and user-submitted data, taking into account home facts, location and market conditions.
The Zestimate is marketed as a tool designed to take the mystery out of real estate for consumers who would otherwise have to rely on brokers and guesswork. However, Zestimates often give a single time point value which might not be an accurate way to base the decision of purchase on.
To make an informed purchase decision, investors or buyers need to be aware of the past trends and how the trends might affect the market henceforth.The purpose of this project is to address this want of the investors by leveraging the information at hand to evaluate the property prices up to the current time point and also understand the anticipated market trends for a reasonable period in the future
Currently the database at our disposal has a zip code level data for properties in the US. The price points are updated only till 2017.
Our objective here is to create a scalable product to forecast the prices of these properties up till 2020 (with a scope of additional forecasts for the future). For the purpose of this project, we are concentrating specifically on the 2-bedroom properties in New York city (NYC).
The ensuing report will showcase the results of 3 randomly selected zip codes in NYC (10013, 10011, 10003) and a step by step understanding of all the processes from cleaning and manipulation of the data to forecasting the data and performing adequacy checks. An R shiny app is also created to give a better visual and functional experience to the investors browsing the app.
The dataset has been sourced from data.world.
The dataset has over 8900 zip codes of properties across the US. In this project, we are concentrating only on the 2 bhk properties listed under New York City. This sets the project scope to 25 zip codes.
#loading the required libraries
library(fpp)
library('tidyr')
library('dplyr')
library(astsa)
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(stringr)
library(formattable)
library(tidyverse)
library(tidyselect)
# Load data ---------------------------------------------------------------
zillow.data <- read.csv("Zip_Zhvi_2bedroom.csv", header=T)
Upon careful consideration of the data, we have decided to consider data points only from 2006 onwards.
Checking the NA values of the selected 25 zip codes has led to the removal of the zip code for Downtown Brooklyn (11201).
# Subsetting data --------------------------------------------------------
zillow.data <- subset(zillow.data, City=='New York')
zillow.data <- subset(zillow.data, RegionID!=62012) #removed since it has numerous NA's
# wide --> long -----------------------------------------------------------
zillow.long<-gather(zillow.data,Year,Price,-c("RegionID","RegionName","City","State","Metro","CountyName","SizeRank"),factor_key = TRUE)
# wrangling date column ---------------------------------------------------
zillow.long$Year<-substring(zillow.long$Year, 2)
zillow.long$month<- substr(zillow.long$Year, 6, nchar(zillow.long$Year))
zillow.long$Year<-substr(zillow.long$Year, 1, 4)
zillow.long$day<-1
zillow.long$date <- as.Date(with(zillow.long, paste(Year, month, day,sep="-")), "%Y-%m-%d")
drops <- c("Year","month","day")
zillow.long<-zillow.long[ , !(names(zillow.long) %in% drops)]
zillow.long$RegionID<-as.factor(zillow.long$RegionID)
zillow.long$RegionName<-as.factor(zillow.long$RegionName)
zillow.df<-zillow.long
# Sorting data ------------------------------------------------------------
zillow.df<-zillow.df[order(zillow.df$SizeRank),]
# Subsetting to consider data from 2006 ---------------------------------------
zillow.df<-zillow.df %>%
group_by(RegionID) %>%
filter(date >= as.Date("2006-01-01"))
# Retaining only the required cols ----------------------------------------
zillow_subset<-zillow.df[,c('RegionName','Price','date')]
zillow_subset<-spread(zillow_subset,RegionName,Price)
The data is then converted to time series data and further analysis is done using the “fpp”, “astsa” and “timeseries”.
The data then has to be converted to a time series data after selecting the relevant columns with each column representing the time series data of a particular zip code .
Given below is a snapshot of the same
# Printing the table
# Converting to timeseries ------------------------------------------------
zillow.forecast<-ts(zillow_subset,frequency = 12,start = 2006)
zillow.forecast<-zillow.forecast[,-1]
n<-dim(zillow.forecast)[2]
zillow.all<-zillow.forecast #can add all columns
kable(head(zillow.forecast,3), format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
| 10003 | 10011 | 10013 | 10014 | 10021 | 10022 | 10023 | 10025 | 10028 | 10036 | 10128 | 10303 | 10304 | 10305 | 10306 | 10308 | 10309 | 10312 | 10314 | 11215 | 11217 | 11231 | 11234 | 11434 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1398900 | 1438800 | 1702200 | 1569000 | 1138000 | 1256800 | 1333000 | 854000 | 1262800 | 1109800 | 966800 | 267100 | 267900 | 341300 | 321600 | 355900 | 336500 | 321500 | 302200 | 498100 | 509200 | 501400 | 383600 | 323000 |
| 1431700 | 1473100 | 1701700 | 1592600 | 1154500 | 1289300 | 1356400 | 834800 | 1292600 | 1131300 | 972300 | 269500 | 270000 | 343200 | 324600 | 356700 | 338100 | 322600 | 303900 | 508200 | 510000 | 508400 | 393900 | 329000 |
| 1377700 | 1444500 | 1730800 | 1578800 | 1154500 | 1324300 | 1362000 | 821700 | 1284900 | 1149700 | 978900 | 272700 | 270900 | 346700 | 326100 | 357100 | 340300 | 323700 | 305400 | 514900 | 511100 | 513200 | 405000 | 336500 |
This section investigates the key insights and trends in three sample zip codes (10013, 10011, 10003), which would further buttress the methodology used to fit an adequate time series model for all zip codes in the dataset.
The below time series plot indicates that the mean and the variance of the time series data ( of median property cost) varies by zip code.
# Taking only 3 random zipcodes -------------------------------------------
zillow.3<-zillow.forecast[,c(1,2,3)]
plot(zillow.3)
The ACF , PACF plots are investigated and an ADF test is conducted to confirm this hypothesis. The plot in the next section are the ACF and PACF plots for the three sample zip codes which further indicates the non-stationarity of the times series data.
Null Hypothesis : The time series is non-stationary
Alternate Hypothesis: The time series is stationary
# Taking only 3 random zipcodes -------------------------------------------
p_value <- c(adf.test(zillow.3[,1])$p.value,adf.test(zillow.3[,2])$p.value,adf.test(zillow.3[,3])$p.value)
lag_order <- c(adf.test(zillow.3[,1])$parameter,adf.test(zillow.3[,2])$parameter,adf.test(zillow.3[,3])$parameter)
dickey_fuller <- c(adf.test(zillow.3[,1])$statistic,adf.test(zillow.3[,2])$statistic,adf.test(zillow.3[,3])$statistic)
zip_code <- c("10003","10011","10013")
#tab <- cbind(p,l,t)
dickey_fuller <- data.frame(zip_code,dickey_fuller,lag_order ,p_value)
kable(dickey_fuller, format = "html") %>%
kable_styling(bootstrap_options = "striped") %>%
column_spec(2, width = "12em")
| zip_code | dickey_fuller | lag_order | p_value |
|---|---|---|---|
| 10003 | -1.2362355 | 5 | 0.8942708 |
| 10011 | -0.7648497 | 5 | 0.9625075 |
| 10013 | -2.0185709 | 5 | 0.5687385 |
At 95% confidence, we fail to reject the null hypothesis for all three zip codes. Hence it is confirmed that the time series data is not stationary in nature.
ACF and PACF Plots
The ACF and PACF plots of the zip codes show a seasonal pattern across time lags. This indicates that an ARIMA model would not successfully capture the seasonal component in the time series. Hence a seasonal ARIMA (SARIMA) model would be imperative to accurately forecast the median housing prices.
A key insight is that while the seasonal patterns across all three zip codes occur at every 12 lags, the pattern in itself is different for each zip code. This indicates that a single seasonal ARIMA (SARIMA) model would not be an adequate fit for all zip codes in the dataset.
# Plotting ACF -------------------------------------------
par(mfrow = c(2,2))
acf(zillow.3[,1], main = "ACF of zip code: 10003")
acf(zillow.3[,2], main = "ACF of zip code: 10011")
acf(zillow.3[,3], main = "ACF of zip code: 10013")
# Plotting PACF -------------------------------------------
par(mfrow = c(2,2))
pacf(zillow.3[,1], main = "PACF of zip code: 10003")
pacf(zillow.3[,2], main = "PACF of zip code: 10011")
pacf(zillow.3[,3], main = "PACF of zip code: 10013")
The following section looks into identifying the best fit SARIMA model to predict the median housing price of a zip code. The fitted model is then tested for adequacy and the methodology is scaled to all zip codes in the data set.
The exploratory data analysis section reveals that each zip code follows a seasonal pattern every 12 periods. Hence S (number of time steps for a single seasonal period) is kept at 12. A grid search algorithm is created to identify the
most optimal difference order (d),
autoregressive order(p),
moving average order(q) of the data.
Further the grid search algorithm also identifies the
most optimal seasonal autoregressive order (P) ,
seasonal moving average order (Q) and
seasonal difference order (D). The model hyperparameters that produce the lowest AIC value are chosen as the best fit model for a particular zip code. The model selected through grid search algorithm is then benchmarked with the auto arima model based on the respective AIC values.
We observe that grid search algorithm produces a better fit model when compared to the auto arima model (for all three zip codes).
The best fit model identified through grid search algorithm is then tested for adequacy using residual analysis and Ljung - Box test. The LJung-Box test results of the three sample zip codes have been shown below.
ts_10003 <-zillow.all[,1]
fit10003 <- Arima(ts_10003, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)
checkresiduals(fit10003) #diagnostics
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 10.373, df = 18, p-value = 0.9191
##
## Model df: 6. Total lags used: 24
ts_10011 <-zillow.all[,2]
fit10011 <- Arima(ts_10011, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)
checkresiduals(fit10011) #diagnostics
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 14.758, df = 18, p-value = 0.6785
##
## Model df: 6. Total lags used: 24
ts_10013 <-zillow.all[,3]
fit10013 <- Arima(ts_10013, order=c(2,2,3), seasonal=c(1,0,0),include.constant = FALSE,lambda=0)
checkresiduals(fit10013) #diagnostics
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,2,3)(1,0,0)[12]
## Q* = 34.388, df = 18, p-value = 0.01128
##
## Model df: 6. Total lags used: 24
The Ljung-Box test confirms the adequacy of the three SARIMA models selected for the sample zip codes. The adequacy testing process is then scaled for all zip codes in the dataset ( each with its own best fit SARIMA model ).
As we can observe, the forecasted median property price trends differ significantly for each zip code. The median property price for each zip code is generated for April 2020.
fit10003 %>% forecast(h=36) %>% autoplot()
fit10011 %>% forecast(h=36) %>% autoplot()
fit10013 %>% forecast(h=36) %>% autoplot()
The below table summarizes the best fit SARIMA model generated using grid search and further compares it with the model generated using auto arima in R. The model highlighted in green gives the lowest AIC value.
An R Shiny app was developed that helps the client to self-serve the delivery of key insights. Please find the link to the app here
The analysis forecasts the median housing price for zip codes in New York City. The initial exploratory data analysis rendered insights into the non-stationary and seasonal nature of the time series data. It was also derived that a single seasonal ARIMA (SARIMA) model would not be an adequate fit for all zip codes. Hence the grid search algorithm was used to find the optimal SARIMA hyperparameters for each zip code. The model was then fit and tested for adequacy. The rendered model performed at par or above in comparison to the auto arima models. Additionally, the model was used to forecast the median housing price by zip code for April (2020). Finally, an R Shiny app was developed that helps clients self-serve the delivery of key insights. The app has the capability to accommodate different hyperparameter values to generate price forecasts.
References