Zillow Analysis

12/8/2019

Zillow Price Analysis

Business Context: Zillow is an online American real estate website with the purpose of empowering consumers with information helping them make informed decisions for buying, selling and renting properties. Our aim is to predict the Selling Price of properties by using the available properties on Zillow for multiple neighborhoods.

Problem Description: Whenever a property holder wants to sell his property, below are some of the issues faced: - may not be aware of the current price of the property - may not know comparable prices of similar kind of properties in the area

Original data analytics plan

Use DeepSearchResults which requires the use of address, zipcode or city/state of properties to get ZPIDs
Use DeepComps api to get multiple zpid’s(Zillow Property ID) & to get the property details
Use Zillow website to get address details as a secondary data source
Data cleaning using tidyr and/or dplyr
Exploratory data analysis
Models planned - Linear Regression, Decision Tree and Random Forest

Summary of Peer Comments

May run into multicollinearity because your inputs may have high correlations (Ex: sq footage and room #) - We tried to reduce the multicollinearity by running VIF function (Variance Inflation Factor)
Tough regulations with API’s may cause difficulty in getting all the data in a single file because single API call gives 25 records - We passed round 1000 ZPIDs in DeepComps via iteration to get around 25,000 records required for the dataset
Any plan to improve the RMSE score for your machine learning models - In-order to improve RMSE, we tried cross validation, hyper parameter tuning as well as tuning the parameters for tree models

Plan on using models other than Linear Regression to compare results? - We used GLM, GBM, Random Forest, Deep Learning Algorithm
Are you planning to consider factors such as Zip code to determine the price - Zip code is not a factor used directly, but we have compared multiple zipcodes to understand variations in property prices for different areas in our EDA
Factors that will be used to predict selling price of a property like timeline? - zest_monthlychange, zest_percentile, bedrooms, bathrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt

How will you deal with outlier values? - We have filtered out some illogical/rare values like having a zestimate value greater than 10000000 and number of bedrooms greater than 5
Are you planning to consider the historic prices of a property while doing the prediction? - We have used the last sold price for the properties in our analysis

Data summary

The dataset used in this assignment has been taken from an external Data Source: To collect valid New York Addresses, we used the New York City Pluto database.
The addresses are then used to fetch ZPID(Zillow Property IDs) using GetDeepSearchResults API. Next, the ZPID’s collected are then passed to GetDeepComps API to fetch the final dataset.

The final dataset contains 24,482 observations about multiple zillow properties. There are 28 features defining these properties which includes address, zipcode, city, state, latitude, longitude, region name, region id, type, Zestimate, Zest_lastupdated, zest_monthlyChange, Zest_percentile, Zestimate_low, Zestimate_high, compsScore, bathrooms, bedrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt, taxAssessment, taxAssessmentYear, totalRooms, YearBuilt.
The final dataset was converted to csv and the file has been saved in our Git Repository

Data Exploration

Number of Properties vs zestimate values for the properties

Relationship between the square feet and zestimate value

Relationship between distance and zestimate value

Property price distribution with respect to the number of bedrooms

Property price distribution with respect to property size

Comparing the prices of properties among different area of central brooklyn, by filtering with the area’s zipcodes

Machine Learning Procedure

Simple Linear Regression

lm_model1 <- lm(zestimate ~ zestimate_low+zestimate_high+
                  zest_monthlychange+zest_percentile+
                  compscore+bedrooms+bathrooms+
                  finishedSqFt+lastSoldPrice+lotSizeSqFt , 
                data = train.data)

To check multi collinearity

car::vif(lm_model1)

zestimate_low an zestimate_high show high values of Variance Inflation Factor(ie VIF > 5). These have high dependence on our prediction. Hence, we will remove these 2 features and use the remaining features for modeling and prediction.

GLM : Generalized Linear Model

glm <- h2o.glm(family= "gaussian", x= x, y=y, 
               training_frame=train_h2o, lambda = 0, 
               compute_p_values = TRUE)

H2ORegressionMetrics: glm

RMSE: 332260.8
RMSLE: 0.2483794
R^2 : 0.7604126

Random forest - 5-Fold Cross-validation

rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train_h2o,
                          ntrees = 50,
                          nfolds = 5,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

H2ORegressionMetrics: drf

RMSE: 146563.2
RMSLE: 0.05751715
R^2 : 0.9533818

GBM

gbm <- h2o.gbm(x = x, y = y, training_frame = train_h2o)

H2ORegressionMetrics: gbm

RMSE: 88505.25
RMSLE: 0.07396035
R^2 : 0.9806906

GBM with parameters

gbm2 <- h2o.gbm(
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  ntrees = 10000,
  learn_rate=0.01,
  stopping_rounds = 5, 
  stopping_tolerance = 1e-4, stopping_metric = "deviance",
  sample_rate = 0.8,
  col_sample_rate = 0.8,
  seed = 1234,
  score_tree_interval = 10
)

H2ORegressionMetrics: gbm

RMSE: 66803.47
RMSLE: 0.04603479
R^2 : 0.9889991

Deep Learning Model 1

m1 <- h2o.deeplearning(
  model_id = "dl_model_first",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o, 
  epochs = 10
)

H2ORegressionMetrics: deeplearning

RMSE: 468974.5
RMSLE: 0.2071632
R^2 : 0.45784

Deep Learning Model 2

m2 <- h2o.deeplearning(
  model_id = "dl_model_faster",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  hidden = c(32,32,32),
  epochs = 1000000, 
  score_validation_samples = 10000, 
  stopping_metric = "deviance", 
  stopping_rounds = 2, 
  stopping_tolerance = 0.01 
)

H2ORegressionMetrics: deeplearning

RMSE: 100355
RMSLE: 0.06083302
R^2 : 0.97517

Deep Learning Model 3

m3 <- h2o.deeplearning(
  model_id="dl_model_tuned",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  overwrite_with_best_model = F, 
  hidden = c(50,50,50), 
  epochs = 10,
  score_validation_samples = 100, 
  score_duty_cycle = 0.025, 
  adaptive_rate = F, 
  rate = 0.01,
  rate_annealing = 2e-6,
  momentum_start = 0.2, 
  momentum_stable = 0.4,
  momentum_ramp = 1e7,
  l1 = 1e-5, 
  l2 = 1e-5,
  max_w2 = 10 
)

H2ORegressionMetrics: deeplearning

RMSE: 636930.5
RMSLE: 0.4893421
R^2: -0.01363

Grid Search

hyper_params <- list(hidden=list(c(20,20),c(50,50),c(30,30,30),c(25,25,25,25)),
  input_dropout_ratio=c(0,0.05),
  l1=seq(0,1e-4,1e-6),
  l2=seq(0,1e-4,1e-6))
search_criteria = list(strategy = "RandomDiscrete",max_runtime_secs = 360,
                       max_models = 100, 
                       seed=1234567,stopping_rounds=5,stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
  algorithm="deeplearning",grid_id = "dl_grid_random",
  training_frame= train_h2o,validation_frame= valid_h2o, 
  x=x, y=y,
  epochs=1,stopping_metric="deviance",
  stopping_tolerance=1e-2,        stopping_rounds=2,
  score_validation_samples=10000, 
  score_duty_cycle=0.025,         
  max_w2=10,hyper_params = hyper_params,
  search_criteria = search_criteria)   
grid <- h2o.getGrid("dl_grid_random",sort_by="rmsle",decreasing=FALSE)

We see that deep learning model with hidden layer [25,25,25,25] gives the best output with the lowest rmsle value.

Result Summary

results<-data.frame(Model_Name = c("GLM", "Random Forest", "GBM",
                                   "GBM with Parameters","Deep Learning 1", 
                                   "Deep Learning 2","Deep Learning 3" ),
   RSquare = c(0.760, 0.953, 0.980, 0.988, 0.457, 0.975, -0.013),
   RMSE= c(332260.8, 146563.2, 88505.25, 66803.47, 468974.5, 100355, 636930.5),
   RMLSE = c(0.248, 0.057, 0.073, 0.046, 0.207, 0.060, 0.489))

library(kableExtra)
kable(results) %>%kable_styling(bootstrap_options = "striped") %>%
  row_spec(4, bold = T, background = "#baeeb9")

Model_Name	RSquare	RMSE	RMLSE
GLM	0.760	332260.80	0.248
Random Forest	0.953	146563.20	0.057
GBM	0.980	88505.25	0.073
GBM with Parameters	0.988	66803.47	0.046
Deep Learning 1	0.457	468974.50	0.207
Deep Learning 2	0.975	100355.00	0.060
Deep Learning 3	-0.013	636930.50	0.489

Comparing results from all the above models, we find that the best prediction is done by the model: “GBM with parameters”. The summarised results show the best values of R2 as 0.998, RMSE = 21543, RMSLE = 0.03, which is better than any of the other models used for predicting the property prices.