Zillow House Price Prediction and Time Series Forecasting for Investment Opportunities

In this project, we intend to cover the following topics as part of our analysis: 1. The house price index transition across all states are generally corelated, because we have the house price data for multiple years across states, 2. we can use that data to identify trends in the House Price Index transition over a period of time, once we have that we can identify states which are highly correlated.

Every time there is a drop, in the correlation it would give rise to Investment opportunities. 
Once trends and Investment opportunities have been identified for the values in the past, similar analysis can be done on the forecasted values and that would give us beforehand idea of suitable investment opportunity.

Let’s load few required libraries to run our data processing and fitting machine learning models.

library(tibble)
library(knitr)
library(dplyr)
library(tidyverse)
library(reshape)
library(lubridate)
library(zoo)
library(forecast)
library(plotly)
library(ggfortify)
library(timeSeries)
library(RColorBrewer)
library(corrplot)
library(tidyverse)
library(caret)
library(gbm)
library(h2o)
h2o.init(nthreads = -1)
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     /var/folders/0y/khcr0pd55bx60q1zzqy4d0wh0000gn/T//RtmpcuCoG5/h2o_lalithab_started_from_r.out
##     /var/folders/0y/khcr0pd55bx60q1zzqy4d0wh0000gn/T//RtmpcuCoG5/h2o_lalithab_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: .. Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 360 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    4 months and 3 days !!! 
##     H2O cluster name:           H2O_started_from_R_lalithab_lzn763 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.78 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.2 (2018-12-20)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 3 days)!
## Please download and install the latest version from http://h2o.ai/download/

###Reading the input data

zill <- as_tibble(read.csv("State_MedianValuePerSqft_AllHomes.csv", stringsAsFactors = FALSE))
df <- read.csv('./ZillowApi_MasterData.csv')

####Dimension of the dataset for time series analysis

dim(zill)
## [1]  50 279

####Dimension of the dataset for House price Predictions

dim(df)
## [1] 38021    21

####Basic Data manipulation and Null validations We’ll follow the tidy data principles to clean and transpose our data in order to structure it to better fit the algorithms.

names(zill)[4:279]<-gsub("X","",colnames(zill[,4:279]))
sum(is.na(zill))
## [1] 107

Will create a dataframe with the required columns: date and median.

zill_df <- zill %>% gather(date, median, 4:279, na.rm = TRUE)
zill_df <- arrange(zill_df, date)

Converting the Entire dataset to time series object. This would allow us to perform time series analysis on the data. For that we’ll leverage zoo library to convert the date column to date datatype.

zill_df$date <- gsub("[:.:]", "-", zill_df$date)
zill_df$date <- as.Date(as.yearmon(zill_df$date))
head(zill_df)
## # A tibble: 6 x 5
##   RegionID RegionName   SizeRank date       median
##      <int> <chr>           <int> <date>      <int>
## 1        9 California          1 1996-04-01    103
## 2       54 Texas               2 1996-04-01     53
## 3       43 New York            3 1996-04-01     72
## 4       14 Florida             4 1996-04-01     55
## 5       21 Illinois            5 1996-04-01     86
## 6       47 Pennsylvania        6 1996-04-01     57

Filtering the data to create time series object and checking for nulls that might have been created in the due course

zill_ts_df <- zill_df %>% select(RegionName,date, median)
sum(is.na(zill_ts_df))
## [1] 0

Let’s get in to the exploratory data analysis

The below plot showcases the transisiton of Median value per square feet transition across all the states in USA. From the transition we can easily conclude that house prices per square feet across all the states follow each other and are highly correlated. There are strong trends which coincides with know major phenomenan that impact housing prices in general. The trends in the past can be used to forecast values in the future by incorporating the trend and seasonality of the data and can be used to identify a suitable investment oppurtunity

pg <- plot_ly(zill_df, x = ~date, y = ~median, color = ~RegionName, type = 'scatter', mode = 'lines')
pg
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Next, we’ll visualize a correlation matrix. This will help us to identify how correlated the prices across all the states are.

zill_allstate <- spread(zill_ts_df, RegionName, median)
#check for nulls
sum(is.na(zill_allstate))
## [1] 107
#log transformation
zill_allstate[,2:50] <- log(zill_allstate[,2:50]+1)



head(zill_allstate)
## # A tibble: 6 x 51
##   date       Alabama Alaska Arizona Arkansas California Colorado
##   <date>       <dbl>  <dbl>   <dbl>    <dbl>      <dbl>    <dbl>
## 1 1996-04-01    3.81   4.28    4.13     3.69       4.64     4.47
## 2 1996-05-01    3.81   4.29    4.13     3.69       4.64     4.47
## 3 1996-06-01    3.81   4.29    4.13     3.69       4.64     4.47
## 4 1996-07-01    3.81   4.29    4.13     3.69       4.64     4.47
## 5 1996-08-01    3.81   4.29    4.13     3.69       4.64     4.47
## 6 1996-09-01    3.83   4.30    4.14     3.66       4.64     4.47
## # … with 44 more variables: Connecticut <dbl>, Delaware <dbl>, `District
## #   of Columbia` <dbl>, Florida <dbl>, Georgia <dbl>, Hawaii <dbl>,
## #   Idaho <dbl>, Illinois <dbl>, Indiana <dbl>, Iowa <dbl>, Kansas <dbl>,
## #   Kentucky <dbl>, Louisiana <dbl>, Maine <dbl>, Maryland <dbl>,
## #   Massachusetts <dbl>, Michigan <dbl>, Minnesota <dbl>,
## #   Mississippi <dbl>, Missouri <dbl>, Montana <dbl>, Nebraska <dbl>,
## #   Nevada <dbl>, `New Hampshire` <dbl>, `New Jersey` <dbl>, `New
## #   Mexico` <dbl>, `New York` <dbl>, `North Carolina` <dbl>, `North
## #   Dakota` <dbl>, Ohio <dbl>, Oklahoma <dbl>, Oregon <dbl>,
## #   Pennsylvania <dbl>, `Rhode Island` <dbl>, `South Carolina` <dbl>,
## #   `South Dakota` <dbl>, Tennessee <dbl>, Texas <dbl>, Utah <dbl>,
## #   Virginia <dbl>, Washington <dbl>, `West Virginia` <dbl>,
## #   Wisconsin <dbl>, Wyoming <int>
#drop nulls
zill_allstate <- zill_allstate %>% drop_na()
sum(is.na(zill_allstate))
## [1] 0
cor <- cor(zill_allstate[,-1])
corrplot(cor, method="shade",
         tl.col = "black", tl.srt = 45, tl.cex = 0.75,cl.cex = 0.75,
         sig.level = 0.01, insig = "pch", pch.cex = 0.8
         )

Filtering the data further to extract the time series transition for Georgia

zillow_ga <- zill_ts_df[zill_ts_df$RegionName == 'Georgia',]
zillow_ga$median <- log10(zillow_ga$median)
# head(zillow_ga)
# nrow(zillow_ga)

Filtering the data further to extract the time series transition for Indiana

zillow_in <- zill_ts_df[zill_ts_df$RegionName == 'Indiana',]
zillow_in$median <- log10(zillow_in$median)
# head(zillow_in)

Basic Time Series Analysis for Georgia Till Date

ga_ts <- ts(zillow_ga$median, start = as.yearmon(zillow_ga$date[1]), freq = 12)
head(ga_ts)
##           Apr      May      Jun      Jul      Aug      Sep
## 1996 1.707570 1.707570 1.707570 1.707570 1.707570 1.716003
class(ga_ts)
## [1] "ts"
start(ga_ts)
## [1] 1996    4
end(ga_ts)
## [1] 2019    3
frequency(ga_ts)
## [1] 12
summary(ga_ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.708   1.806   1.875   1.866   1.924   2.009

Basic Time Series Analysis for Indiana Till Date

in_ts <- ts(zillow_in$median, start = as.yearmon(zillow_in$date[1]), freq = 12)
head(in_ts)
##           Apr      May      Jun      Jul      Aug      Sep
## 1996 1.785330 1.785330 1.778151 1.770852 1.770852 1.763428
class(in_ts)
## [1] "ts"
start(in_ts)
## [1] 1996    4
end(in_ts)
## [1] 2019    3
frequency(in_ts)
## [1] 12
summary(in_ts)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.763   1.833   1.845   1.850   1.882   1.978
sum(is.na(ga_ts))
## [1] 0
sum(is.na(in_ts))
## [1] 0

Time Series Plot until current data for Georgia

p1 <- plot(ga_ts)

Time Series Plot until current data for Indiana

p2 <- plot(in_ts)

Calculating and plotting the Auto Correlation factor and Partial Auto Correlation factor for Georgia, it helps us to determine the parameter required to fit in the time series data into ARIMA model (Auto regressive - Integrated- Moving Average).

acf(ga_ts)

acf(diff(diff(log(ga_ts)))) #q=2, d=2

pacf(diff(diff(log(ga_ts)))) #p=1

Calculating and plotting the Auto Correlation factor and Partial Auto COrrelation factor for Indiana

acf(in_ts)

acf(diff(diff(log(in_ts)))) #q=1, d=2

pacf(diff(diff(log(in_ts)))) #p=1, 

Lets look at the monthly price variation across dataset for Georgia and Indiana

boxplot(ga_ts~cycle(ga_ts, xlab = "date", ylab = "Median", main = "monthly median values for GA from 96 to 18"))

boxplot(ga_ts~cycle(in_ts, xlab = "date", ylab = "Median", main = "monthly median values for IN from 96 to 18"))

We have to keep the trace on for Auto Arima that way we can justify our hyperparameter selection.

###What is ARIMA? ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.

When we fit the model we use auto.arima() function. Auto ARIMA takes into account the AIC and BIC values generated to determine the best combination of parameters. AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) values are estimators to compare models. The lower these values, the better is the model.

Creating the auto ARIMA Model for Georgia

set.seed(41)
gamodel <- auto.arima(ga_ts)
inmodel <- auto.arima(in_ts)

Summary of the Model

summary(gamodel)
## Series: ga_ts 
## ARIMA(0,2,2)(1,0,0)[12] 
## 
## Coefficients:
##           ma1     ma2    sar1
##       -1.2933  0.4755  0.0521
## s.e.   0.0602  0.0595  0.0626
## 
## sigma^2 estimated as 9.316e-06:  log likelihood=1199.01
## AIC=-2390.03   AICc=-2389.88   BIC=-2375.58
## 
## Training set error measures:
##                        ME        RMSE         MAE         MPE      MAPE
## Training set 4.052283e-05 0.003024377 0.002531881 0.002843619 0.1362667
##                    MASE       ACF1
## Training set 0.08739804 0.02491236

Creating the auto ARIMA Model for Indiana

summary(inmodel)
## Series: in_ts 
## ARIMA(2,2,3)(2,0,1)[12] 
## 
## Coefficients:
##          ar1     ar2      ma1     ma2     ma3     sar1    sar2    sma1
##       0.2162  0.6027  -1.4073  0.1232  0.2996  -0.0633  0.0615  0.0815
## s.e.  0.4537  0.3231   0.4593  0.3517  0.2090   0.9779  0.0669  0.9778
## 
## sigma^2 estimated as 1.183e-05:  log likelihood=1168.74
## AIC=-2319.47   AICc=-2318.79   BIC=-2286.96
## 
## Training set error measures:
##                        ME        RMSE         MAE        MPE      MAPE
## Training set 0.0002333671 0.003375994 0.002653448 0.01290224 0.1437155
##                   MASE         ACF1
## Training set 0.1836091 -0.005631734

For both the states, AIC and the BIC values seem to be pretty less. We can say that our model is fitting well.

Fitting arima for Georgia

We’ll fit the arima model to Georgia state with p,d,q values that we found from our auto.arima model:

p=0; d=2; q=2.

fit_ga = arima(ga_ts, c(0,2,2), seasonal = list(order = c(0,2,2), period = 12))
fit_ga
## 
## Call:
## arima(x = ga_ts, order = c(0, 2, 2), seasonal = list(order = c(0, 2, 2), period = 12))
## 
## Coefficients:
##           ma1     ma2     sma1    sma2
##       -1.3109  0.4975  -1.9362  0.9630
## s.e.   0.0729  0.0720   0.1300  0.1301
## 
## sigma^2 estimated as 9.922e-06:  log likelihood = 1036.73,  aic = -2063.46

Fitting auto arima for Indiana

We’ll fit the arima model to Indiana state with p,d,q values that we found from our auto.arima model:

p=2; d=2; q=3.

fit_in = arima(in_ts, c(2,2,3), seasonal = list(order = c(2,2,3), period = 12))
fit_in
## 
## Call:
## arima(x = in_ts, order = c(2, 2, 3), seasonal = list(order = c(2, 2, 3), period = 12))
## 
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
##           ar1      ar2     ma1      ma2      ma3     sar1    sar2     sma1
##       -1.8119  -0.8583  0.6798  -0.7913  -0.5795  -0.4977  0.0076  -1.1989
## s.e.   0.0703   0.0681  0.1027   0.0602   0.1013      NaN  0.0833      NaN
##          sma2    sma3
##       -0.1519  0.3775
## s.e.      NaN     NaN
## 
## sigma^2 estimated as 1.397e-05:  log likelihood = 1006.76,  aic = -1991.53

Predictions for Georgia

Now, we’ll fit the the arima model to forecast for next 10 years for Georgia using n.ahead parameter.

pred_ga = predict(fit_ga, n.ahead = 10*12)
pred_ga$pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2019                            2.011295 2.015141 2.017369 2.021534
## 2020 2.039541 2.040290 2.045311 2.047479 2.052107 2.054802 2.059637
## 2021 2.081615 2.082898 2.088545 2.091400 2.096883 2.100115 2.105693
## 2022 2.132071 2.133959 2.140304 2.143916 2.150326 2.154168 2.160559
## 2023 2.191765 2.194331 2.201444 2.205887 2.213294 2.217817 2.225094
## 2024 2.261558 2.264872 2.272826 2.278170 2.286646 2.291922 2.300155
## 2025 2.342306 2.346441 2.355306 2.361623 2.371239 2.377339 2.386601
## 2026 2.434867 2.439894 2.449743 2.457104 2.467933 2.474929 2.485290
## 2027 2.540101 2.546091 2.556995 2.565472 2.577584 2.585547 2.597079
## 2028 2.658865 2.665890 2.677920 2.687584 2.701052 2.710054 2.722828
## 2029 2.792017 2.800148 2.813377                                    
##           Aug      Sep      Oct      Nov      Dec
## 2019 2.023286 2.026247 2.029999 2.031994 2.033514
## 2020 2.061983 2.065767 2.070190 2.072749 2.074838
## 2021 2.108703 2.113382 2.118548 2.121743 2.124471
## 2022 2.164305 2.169951 2.175931 2.179833 2.183273
## 2023 2.229647 2.236331 2.243197 2.247877 2.252100
## 2024 2.305587 2.313381 2.321204 2.326734 2.331812
## 2025 2.392984 2.401958 2.410810 2.417262 2.423266
## 2026 2.492694 2.502922 2.512874 2.520319 2.527321
## 2027 2.605578 2.617129 2.628253 2.636763 2.644834
## 2028 2.732492 2.745439 2.757806 2.767452 2.776664
## 2029

Predictions for Indiana

Now, we’ll fit the the arima model to forecast for next 10 years for Indiana using n.ahead parameter.

pred_in = predict(fit_in, n.ahead = 10*12)
pred_in$pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2019                            1.981140 1.986097 1.989811 1.994894
## 2020 2.019090 2.023464 2.025875 2.031306 2.035750 2.040759 2.046119
## 2021 2.074169 2.078365 2.081861 2.087318 2.092934 2.098080 2.104539
## 2022 2.135892 2.140759 2.144473 2.150859 2.156822 2.162860 2.169799
## 2023 2.204880 2.210294 2.214448 2.221529 2.228139 2.234801 2.242519
## 2024 2.281334 2.287263 2.291886 2.299698 2.306923 2.314292 2.322724
## 2025 2.365381 2.371898 2.376965 2.385554 2.393396 2.401493 2.410664
## 2026 2.457276 2.464367 2.469921 2.479281 2.487775 2.496602 2.506538
## 2027 2.557196 2.564890 2.570936 2.581095 2.590248 2.599831 2.610540
## 2028 2.665350 2.673661 2.680220 2.691191 2.701024 2.711375 2.722879
## 2029 2.781939 2.790884 2.797972                                    
##           Aug      Sep      Oct      Nov      Dec
## 2019 1.996840 2.002128 2.008421 2.010642 2.012837
## 2020 2.048613 2.054697 2.061296 2.064480 2.066417
## 2021 2.107171 2.114231 2.121407 2.125113 2.127494
## 2022 2.173100 2.180746 2.188771 2.192874 2.195738
## 2023 2.246240 2.254740 2.263427 2.268114 2.271353
## 2024 2.326973 2.336243 2.345702 2.350900 2.354609
## 2025 2.415441 2.425524 2.435743 2.441501 2.445666
## 2026 2.511856 2.522768 2.533767 2.540094 2.544740
## 2027 2.616422 2.628176 2.639974 2.646888 2.652031
## 2028 2.729337 2.741952 2.754565 2.762083 2.767739
## 2029

Converting the Logarithimic value predictions into absolute values by taking the inverse of log10

pred1_ga = round(10^pred_ga$pred,0)
pred2_in = round(10^pred_in$pred,0)
fit1_ga = round(10^ga_ts)
fit2_in = round(10^in_ts)

Lets plot the current and the forecasted values of the our model.

ts.plot(fit1_ga, pred1_ga, fit2_in,pred2_in,
        col = c(2:1,4:1), lwd = 2)

Let’s visualize the same graph in an interactive manner using plotly libraries

df_fit1_ga <- fortify(timeSeries::as.timeSeries(fit1_ga))
df_pred1_ga <- fortify(timeSeries::as.timeSeries(pred1_ga))
df_fit2_in <- fortify(timeSeries::as.timeSeries(fit2_in))
df_pred2_in <- fortify(timeSeries::as.timeSeries(pred2_in))
dfc <- cbind(df_fit1_ga,df_fit2_in)

colnames(dfc) <- c("index_ga", "data_ga", "index_in", "data_in")
dfc <- dfc %>% select(index_ga,data_ga, data_in)


dfc1 <- cbind(df_pred1_ga,df_pred2_in)


colnames(dfc1) <- c("index_ga", "data_ga", "index_in", "data_in")
dfc1 <- dfc1 %>% select(index_ga,data_ga, data_in)

df_main <- rbind(dfc, dfc1)
plot1 <- plot_ly(dfc,x = ~index_ga, y = ~data_ga, name = 'Current Median GA', type = 'scatter', mode = 'lines',
                 line = list(color = 'rgb(205, 12, 24)', width = 4) ) %>%
  add_trace(y = ~data_in, name = 'Median IN', line = list(color = 'rgb(22, 96, 167)', width = 4))


plot2 <- plot_ly(dfc1,x = ~index_ga, y = ~data_ga, name = 'Forecast Median GA', type = 'scatter', mode = 'lines',
                 line = list(color = 'rgb(205, 12, 24)', width = 4) ) %>%
  add_trace(y = ~data_in, name = 'Forecast Median IN', line = list(color = 'rgb(22, 96, 167)', width = 4))


plot_main <- subplot(plot1, plot2)
plot_main

Predicting House Price Index

Now, lets apply the model to predict house price index.

summary(df)
##                   address.street  address.zipcode      address.city  
##  2828 Routh St STE 100   :   92   75201  :   98   Brooklyn   :  691  
##  1330 Campus Pkwy        :   31   11215  :   82   Chicago    :  543  
##  134 Creek Shoals Dr     :   30   8701   :   54   New York   :  541  
##  36 Mill Plain Rd STE 310:   30   30005  :   46   Los Angeles:  391  
##  Byron Ave # NA          :   30   11222  :   45   Houston    :  378  
##  123 N Main St           :   25   (Other):37692   (Other)    :35463  
##  (Other)                 :37783   NA's   :    4   NA's       :   14  
##                   useCode      taxAssessmentYear taxAssessment      
##  SingleFamily         :27817   Min.   :1999      Min.   :1.000e+00  
##  Condominium          : 2460   1st Qu.:2017      1st Qu.:1.418e+05  
##  MultiFamily2To4      : 1560   Median :2018      Median :2.615e+05  
##  Apartment            : 1183   Mean   :2018      Mean   :1.607e+06  
##  VacantResidentialLand: 1070   3rd Qu.:2018      3rd Qu.:4.757e+05  
##  Unknown              :  891   Max.   :2020      Max.   :1.460e+09  
##  (Other)              : 3040   NA's   :5402      NA's   :5768       
##    yearBuilt     lotSizeSqFt         finishedSqFt       bathrooms      
##  Min.   :1111   Min.   :1.000e+00   Min.   :      1   Min.   :  0.010  
##  1st Qu.:1958   1st Qu.:6.990e+03   1st Qu.:   1336   1st Qu.:  1.750  
##  Median :1983   Median :1.307e+04   Median :   1930   Median :  2.000  
##  Mean   :1975   Mean   :7.835e+05   Mean   :   5940   Mean   :  2.385  
##  3rd Qu.:2000   3rd Qu.:4.356e+04   3rd Qu.:   2856   3rd Qu.:  3.000  
##  Max.   :2019   Max.   :2.147e+09   Max.   :3634314   Max.   :150.000  
##  NA's   :7222   NA's   :7512        NA's   :6294      NA's   :9469     
##     bedrooms       zestimate.amount.text localRealEstate.name
##  Min.   :  0.000   Min.   :     7016     Downtown  :  318    
##  1st Qu.:  2.000   1st Qu.:   207133     Oak Lawn  :  132    
##  Median :  3.000   Median :   341025     Austin    :  101    
##  Mean   :  2.848   Mean   :   686347     Lakewood  :   95    
##  3rd Qu.:  4.000   3rd Qu.:   575498     Park Slope:   86    
##  Max.   :231.000   Max.   :407419190     (Other)   :36087    
##  NA's   :7602      NA's   :7390          NA's      : 1202    
##    localRealEstate.type    lastSoldDate   lastSoldPrice.text 
##  city        :24541     6/27/2012:   35   Min.   :        1  
##  neighborhood:12278     4/26/1996:   30   1st Qu.:   137000  
##  NA's        : 1202     1/20/2004:   27   Median :   246500  
##                         7/13/2018:   26   Mean   :   756839  
##                         9/2/2009 :   26   3rd Qu.:   442250  
##                         (Other)  :22400   Max.   :150000000  
##                         NA's     :15477   NA's   :15758      
##  valuationRange.low valuationRange.high zestimate.amount.text.1
##  Mode:logical       Mode:logical        Min.   :     7016      
##  NA's:38021         NA's:38021          1st Qu.:   207133      
##                                         Median :   341025      
##                                         Mean   :   686347      
##                                         3rd Qu.:   575498      
##                                         Max.   :407419190      
##                                         NA's   :7390           
##  localRealEstate.name.1  localRealEstate.type.1
##  Downtown  :  318       city        :24541     
##  Oak Lawn  :  132       neighborhood:12278     
##  Austin    :  101       NA's        : 1202     
##  Lakewood  :   95                              
##  Park Slope:   86                              
##  (Other)   :36087                              
##  NA's      : 1202
dim(df)
## [1] 38021    21

Data Cleaning & Preprocessing

df.clean <- df %>%
  select(useCode,taxAssessment,yearBuilt,lotSizeSqFt,finishedSqFt,bathrooms,bedrooms,lastSoldPrice.text) %>%
  drop_na() %>%
  mutate((2019-yearBuilt))

names(df.clean)[names(df.clean) == '(2019 - yearBuilt)'] <- 'yearOld'
df.clean <- df.clean %>%
  select(useCode,taxAssessment,yearOld,lotSizeSqFt,finishedSqFt,bathrooms,bedrooms,lastSoldPrice.text)

summary(df.clean$lastSoldPrice.text)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        1   142000   240000   402642   409400 97500000
df.clean <- df.clean[ df.clean$lastSoldPrice.text < quantile(df.clean$lastSoldPrice.text , 0.75 ) , ]

summary(df.clean$yearOld)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   20.00   35.00   43.83   61.00  319.00
df.clean <- df.clean[ df.clean$yearOld < quantile(df.clean$yearOld , 0.75 ) , ]

ggplot(df.clean, aes(x =yearOld, y=lastSoldPrice.text)) + geom_point()

ggplot(df.clean, aes(x =useCode, y=lastSoldPrice.text)) + geom_boxplot()

cor <- cor(df.clean[,-1])

#correlation Matrix
corrplot(cor, method="shade",
         tl.col = "black", tl.srt = 45, tl.cex = 0.75,cl.cex = 0.75,
         sig.level = 0.01, insig = "pch", pch.cex = 0.8)

df.cleanDum <- dummyVars("~.", data = df.clean) 
trnf <- data.frame(predict(df.cleanDum, newdata = df.clean))

Lets seed the data and split into training and testing

set.seed(123)
ind = createDataPartition(trnf$lastSoldPrice.text, p =0.8, list = FALSE)
trnf.train <- trnf[ind,]
trnf.test <- trnf[-ind,]
anyNA(trnf)
## [1] FALSE

Linear Regression

lmMod <- lm(lastSoldPrice.text ~., data=trnf.train)
summary(lmMod)
## 
## Call:
## lm(formula = lastSoldPrice.text ~ ., data = trnf.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -530782  -54314    -381   63834  284666 
## 
## Coefficients: (3 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.567e+05  4.133e+04   3.790 0.000152 ***
## useCode.Apartment             -5.649e+04  4.476e+04  -1.262 0.206901    
## useCode.Condominium           -3.097e+04  4.150e+04  -0.746 0.455612    
## useCode.Cooperative                   NA         NA      NA       NA    
## useCode.Duplex                -9.093e+04  4.576e+04  -1.987 0.046966 *  
## useCode.Miscellaneous         -5.973e+04  4.707e+04  -1.269 0.204529    
## useCode.Mobile                -1.263e+05  4.228e+04  -2.986 0.002833 ** 
## useCode.MultiFamily2To4       -7.307e+04  4.246e+04  -1.721 0.085271 .  
## useCode.MultiFamily5Plus      -5.740e+04  5.810e+04  -0.988 0.323199    
## useCode.Quadruplex            -1.502e+04  7.695e+04  -0.195 0.845240    
## useCode.SingleFamily          -5.710e+04  4.113e+04  -1.388 0.165061    
## useCode.Townhouse             -4.543e+04  4.164e+04  -1.091 0.275325    
## useCode.Triplex               -1.405e+05  7.692e+04  -1.827 0.067752 .  
## useCode.Unknown                       NA         NA      NA       NA    
## useCode.VacantResidentialLand         NA         NA      NA       NA    
## taxAssessment                  1.486e-01  7.297e-03  20.357  < 2e-16 ***
## yearOld                       -4.720e+02  7.695e+01  -6.134 9.02e-10 ***
## lotSizeSqFt                   -6.641e-05  1.039e-04  -0.639 0.522712    
## finishedSqFt                  -4.827e+00  1.347e+00  -3.583 0.000342 ***
## bathrooms                      2.191e+04  1.872e+03  11.704  < 2e-16 ***
## bedrooms                       1.044e+04  1.547e+03   6.745 1.65e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91760 on 6918 degrees of freedom
## Multiple R-squared:  0.1781, Adjusted R-squared:  0.1761 
## F-statistic: 88.21 on 17 and 6918 DF,  p-value: < 2.2e-16

Gradient Boosting by using all cores

gbm.fit <- gbm(
  formula = lastSoldPrice.text ~ .,
  distribution = "gaussian",
  data = trnf.train,
  n.trees = 10000,
  interaction.depth = 1,
  shrinkage = 0.001,
  cv.folds = 5,
  n.cores = NULL, # 
  verbose = FALSE
)  
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 3: useCode.Cooperative has no variation.
## Warning in gbm.fit(x = x, y = y, offset = offset, distribution =
## distribution, : variable 13: useCode.Unknown has no variation.

Get MSE and compute RMSE

min_MSE <- which.min(gbm.fit$cv.error)
sqrt(gbm.fit$cv.error[min_MSE])
## [1] 80594.81

Plot loss function as a result of n trees added to the ensemble

gbm.perf(gbm.fit, method = "cv")
## [1] 10000
min_MSE <- which.min(gbm.fit$cv.error)

# get MSE and compute RMSE
sqrt(gbm.fit$cv.error[min_MSE])
## [1] 80594.81
# plot loss function as a result of n trees added to the ensemble
gbm.perf(gbm.fit, method = "cv")

## [1] 10000
pred <- predict(gbm.fit, n.trees = 10000, trnf.test)

RMSE(pred, trnf.test$lastSoldPrice.text)
## [1] 81252

H20 GLM

# Training data: Separate into x and y tibbles
x_train_tbl <- trnf.train %>% select(-lastSoldPrice.text)
y_train_tbl <- trnf.train %>% select(lastSoldPrice.text)

# Testing data: What we submit in the competition
x_test_tbl <- trnf.test %>% select(-lastSoldPrice.text)
y_test_tbl <- trnf.test %>% select(lastSoldPrice.text)

Pushing Data into h20

data_h2o <- as.h2o(
  bind_cols(y_train_tbl, x_train_tbl),
  destination_frame= "train.hex")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
new_data_h2o <- as.h2o(
  x_test_tbl,
  destination_frame= "test.hex") 
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Partition the data into training, validation and test sets

splits <- h2o.splitFrame(data = data_h2o,
                         ratios = c(0.7, 0.15), # 70/15/15 split
                         seed = 1234)
train_h2o <- splits[[1]] # from training data
valid_h2o <- splits[[2]] # from training data
test_h2o <- splits[[3]] # from training data
y <- "lastSoldPrice.text"
x <- setdiff(names(train_h2o),y)

Fitting GLM

grid <- h2o.glm(
  x = x,
  y = y,
  
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  
  nfolds = 5,
  alpha = 0.01,
  lambda = 0.0001,
  early_stopping = TRUE
  )
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [useCode.Cooperative, useCode.Triplex, useCode.Unknown].
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

There is a lot of scope in this work with respect to optimatization. The data is biased because of the visible skewness and it also has a lot of sparseness. The current models predict the house prices indices with higher rmse values that can further be tuned to get more accuracy.

In all the models above, GBM has outperformed as it has given the lowest rmse. GBM is an excellent ML model for guassian distribution dataset, however the family of the data distribution like guassian, poisson, binomial etc can always be mentioned while applying the GBM ML Model.

Shutdown H2O

h2o.shutdown(prompt = F)
## [1] TRUE