12/8/2019

Zillow Price Analysis

Business Context: Zillow is an online American real estate website with the purpose of empowering consumers with information helping them make informed decisions for buying, selling and renting properties. Our aim is to predict the Selling Price of properties by using the available properties on Zillow for multiple neighborhoods.

Problem Description: Whenever a property holder wants to sell his property, below are some of the issues faced: - may not be aware of the current price of the property - may not know comparable prices of similar kind of properties in the area

Original data analytics plan

  • Use DeepSearchResults which requires the use of address, zipcode or city/state of properties to get ZPIDs
  • Use DeepComps api to get multiple zpid’s(Zillow Property ID) & to get the property details
  • Use Zillow website to get address details as a secondary data source
  • Data cleaning using tidyr and/or dplyr
  • Exploratory data analysis
  • Models planned - Linear Regression, Decision Tree and Random Forest

Summary of Peer Comments

  • May run into multicollinearity because your inputs may have high correlations (Ex: sq footage and room #) - We tried to reduce the multicollinearity by running VIF function (Variance Inflation Factor)
  • Tough regulations with API’s may cause difficulty in getting all the data in a single file because single API call gives 25 records - We passed round 1000 ZPIDs in DeepComps via iteration to get around 25,000 records required for the dataset
  • Any plan to improve the RMSE score for your machine learning models - In-order to improve RMSE, we tried cross validation, hyper parameter tuning as well as tuning the parameters for tree models

  • Plan on using models other than Linear Regression to compare results? - We used GLM, GBM, Random Forest, Deep Learning Algorithm
  • Are you planning to consider factors such as Zip code to determine the price - Zip code is not a factor used directly, but we have compared multiple zipcodes to understand variations in property prices for different areas in our EDA
  • Factors that will be used to predict selling price of a property like timeline? - zest_monthlychange, zest_percentile, bedrooms, bathrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt

  • How will you deal with outlier values? - We have filtered out some illogical/rare values like having a zestimate value greater than 10000000 and number of bedrooms greater than 5
  • Are you planning to consider the historic prices of a property while doing the prediction? - We have used the last sold price for the properties in our analysis

Data summary

  • The dataset used in this assignment has been taken from an external Data Source: To collect valid New York Addresses, we used the New York City Pluto database.

  • The addresses are then used to fetch ZPID(Zillow Property IDs) using GetDeepSearchResults API. Next, the ZPID’s collected are then passed to GetDeepComps API to fetch the final dataset.

  • The final dataset contains 24,482 observations about multiple zillow properties. There are 28 features defining these properties which includes address, zipcode, city, state, latitude, longitude, region name, region id, type, Zestimate, Zest_lastupdated, zest_monthlyChange, Zest_percentile, Zestimate_low, Zestimate_high, compsScore, bathrooms, bedrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt, taxAssessment, taxAssessmentYear, totalRooms, YearBuilt.

  • The final dataset was converted to csv and the file has been saved in our Git Repository

Data Exploration

Number of Properties vs zestimate values for the properties

Relationship between the square feet and zestimate value

Relationship between distance and zestimate value

Property price distribution with respect to the number of bedrooms

Property price distribution with respect to property size

Comparing the prices of properties among different area of central brooklyn, by filtering with the area’s zipcodes

Machine Learning Procedure

Simple Linear Regression

lm_model1 <- lm(zestimate ~ zestimate_low+zestimate_high+
                  zest_monthlychange+zest_percentile+
                  compscore+bedrooms+bathrooms+
                  finishedSqFt+lastSoldPrice+lotSizeSqFt , 
                data = train.data)

To check multi collinearity

car::vif(lm_model1)

zestimate_low an zestimate_high show high values of Variance Inflation Factor(ie VIF > 5). These have high dependence on our prediction. Hence, we will remove these 2 features and use the remaining features for modeling and prediction.

GLM : Generalized Linear Model

glm <- h2o.glm(family= "gaussian", x= x, y=y, 
               training_frame=train_h2o, lambda = 0, 
               compute_p_values = TRUE)

H2ORegressionMetrics: glm

  • RMSE: 332260.8
  • RMSLE: 0.2483794
  • R^2 : 0.7604126

Random forest - 5-Fold Cross-validation

rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train_h2o,
                          ntrees = 50,
                          nfolds = 5,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

H2ORegressionMetrics: drf

  • RMSE: 146563.2
  • RMSLE: 0.05751715
  • R^2 : 0.9533818

GBM

gbm <- h2o.gbm(x = x, y = y, training_frame = train_h2o)

H2ORegressionMetrics: gbm

  • RMSE: 88505.25
  • RMSLE: 0.07396035
  • R^2 : 0.9806906

GBM with parameters

gbm2 <- h2o.gbm(
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  ntrees = 10000,
  learn_rate=0.01,
  stopping_rounds = 5, 
  stopping_tolerance = 1e-4, stopping_metric = "deviance",
  sample_rate = 0.8,
  col_sample_rate = 0.8,
  seed = 1234,
  score_tree_interval = 10
)

H2ORegressionMetrics: gbm

  • RMSE: 66803.47
  • RMSLE: 0.04603479
  • R^2 : 0.9889991

Deep Learning Model 1

m1 <- h2o.deeplearning(
  model_id = "dl_model_first",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o, 
  epochs = 10
)

H2ORegressionMetrics: deeplearning

  • RMSE: 468974.5
  • RMSLE: 0.2071632
  • R^2 : 0.45784

Deep Learning Model 2

m2 <- h2o.deeplearning(
  model_id = "dl_model_faster",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  hidden = c(32,32,32),
  epochs = 1000000, 
  score_validation_samples = 10000, 
  stopping_metric = "deviance", 
  stopping_rounds = 2, 
  stopping_tolerance = 0.01 
)

H2ORegressionMetrics: deeplearning

  • RMSE: 100355
  • RMSLE: 0.06083302
  • R^2 : 0.97517

Deep Learning Model 3

m3 <- h2o.deeplearning(
  model_id="dl_model_tuned",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  overwrite_with_best_model = F, 
  hidden = c(50,50,50), 
  epochs = 10,
  score_validation_samples = 100, 
  score_duty_cycle = 0.025, 
  adaptive_rate = F, 
  rate = 0.01,
  rate_annealing = 2e-6,
  momentum_start = 0.2, 
  momentum_stable = 0.4,
  momentum_ramp = 1e7,
  l1 = 1e-5, 
  l2 = 1e-5,
  max_w2 = 10 
)

H2ORegressionMetrics: deeplearning

  • RMSE: 636930.5
  • RMSLE: 0.4893421
  • R^2: -0.01363

Grid Search

Result Summary

results<-data.frame(Model_Name = c("GLM", "Random Forest", "GBM",
                                   "GBM with Parameters","Deep Learning 1", 
                                   "Deep Learning 2","Deep Learning 3" ),
   RSquare = c(0.760, 0.953, 0.980, 0.988, 0.457, 0.975, -0.013),
   RMSE= c(332260.8, 146563.2, 88505.25, 66803.47, 468974.5, 100355, 636930.5),
   RMLSE = c(0.248, 0.057, 0.073, 0.046, 0.207, 0.060, 0.489))

library(kableExtra)
kable(results) %>%kable_styling(bootstrap_options = "striped") %>%
  row_spec(4, bold = T, background = "#baeeb9")
Model_Name RSquare RMSE RMLSE
GLM 0.760 332260.80 0.248
Random Forest 0.953 146563.20 0.057
GBM 0.980 88505.25 0.073
GBM with Parameters 0.988 66803.47 0.046
Deep Learning 1 0.457 468974.50 0.207
Deep Learning 2 0.975 100355.00 0.060
Deep Learning 3 -0.013 636930.50 0.489

Comparing results from all the above models, we find that the best prediction is done by the model: “GBM with parameters”. The summarised results show the best values of R2 as 0.998, RMSE = 21543, RMSLE = 0.03, which is better than any of the other models used for predicting the property prices.