Title: Zillow Analysis

Team Members: Akansha Jain, Ranjani Anjur Venkatraman, Swapna Bandaru

Business Context: Zillow is an online American real estate database with the purpose of empowering consumers with information helping them make informed decisions for buying, selling and renting properties. Our aim is to predict the Selling Price of properties by using the available properties on Zillow for multiple neighborhoods.

Problem Description: Whenever a property holder wants to sell his property, below are some of the issues faced: may not be aware of the current price of the property may not know comparable prices of similar kind of properties in the area

URL link : https://github.com/RanjaniAnjurVenkatraman/Zillow_Data/blob/master/zillow_final.zip

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

Data Summary

The dataset used in this assignment has been taken from an external Data Source: To collect valid New York Addresses, we used the New York City Pluto database: The addresses are then used to fetch ZPID(Zillow Property IDs) using GetDeepSearchResults API. Next, the ZPID’s collected are then passed t0 GetDeepComps API to fetch the final dataset. The final dataset contains 24,482 observations about multiple zillow properties. There are 28 features defining these properties which includes address, zipcode, city, state, latitude, longitude, region name, region id, type, Zestimate, Zest_lastupdated, zest_monthlyChange, Zest_percentile, Zestimate_low, Zestimate_high, compsScore, bathrooms, bedrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt, taxAssessment, taxAssessmentYear, totalRooms, YearBuilt. The final dataset was converted to csv and the file has been saved in out Git Repository available at https://github.com/RanjaniAnjurVenkatraman/Zillow_Data/blob/master/zillow_final.zip.

The analysis begins with loading the dataset and the required libraries in R.

library(dplyr)
library(tidyverse)
library(knitr)
library(ggthemes)
library(caret)
library(h2o)
library(ggplot2)
library(geosphere)
library(kableExtra)

Data load

We are using the data uploaded to git repository. To download the data we hit the git repository link and download the zip file. We then unzip it to get the data in the system for analysis.

temp <- tempfile()
download.file("https://raw.githubusercontent.com/RanjaniAnjurVenkatraman/Zillow_Data/master/zillow_final.zip",temp)
zdata <- read.csv(unz(temp, "zillow.csv"), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)
unlink(temp)

str(zdata)

## 'data.frame':    24482 obs. of  28 variables:
##  $ X                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ zpid              : int  2096383529 121711481 245067526 245014557 245012575 30566078 30566079 68312584 300077093 68312616 ...
##  $ address           : chr  NA "183 Plymouth St PENTHOUSE N" "130 Furman St APT 408" "51 Jay St APT 1M" ...
##  $ zipcode           : int  NA 11201 11201 11201 11201 11201 11201 11201 11201 11201 ...
##  $ city              : chr  NA "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ state             : chr  NA "NY" "NY" "NY" ...
##  $ lat               : num  NA 40.7 40.7 40.7 40.7 ...
##  $ long              : num  NA -74 -74 -74 -74 ...
##  $ region_name       : chr  NA "DUMBO" "Brooklyn Heights" "DUMBO" ...
##  $ region_id         : int  NA 270841 403122 270841 403122 270841 270841 270841 403122 270841 ...
##  $ type              : chr  NA "neighborhood" "neighborhood" "neighborhood" ...
##  $ zestimate         : int  NA 4108248 4903925 2062475 2376619 3321319 4405368 2351179 2879316 2168759 ...
##  $ zest_lastupdated  : chr  NA "12/1/2019" "12/1/2019" "12/1/2019" ...
##  $ zest_monthlychange: int  NA 19592 52400 -6919 -8769 -13587 -18049 -9633 -54455 -7262 ...
##  $ zest_percentile   : int  NA 97 97 83 88 95 97 88 0 85 ...
##  $ zestimate_low     : int  NA 3738506 4413533 1918102 2186489 3055613 4052939 2163085 2591384 2016946 ...
##  $ zestimate_high    : int  NA 4313660 5443357 2206848 2566749 3620238 4801851 2562785 3080868 2320572 ...
##  $ compscore         : int  NA NA 5 11 11 4 3 10 8 11 ...
##  $ bathrooms         : num  NA NA 4 3 1 2.5 3.5 3 NA 3 ...
##  $ bedrooms          : int  NA NA 3 2 1 2 3 3 0 3 ...
##  $ finishedSqFt      : int  NA 2156 4163 1451 1633 2198 2592 1700 1746 1656 ...
##  $ lastSoldDate      : chr  NA "5/22/2014" "7/2/2019" "7/9/2019" ...
##  $ lastSoldPrice     : num  NA 3082039 4291240 2078000 2400000 ...
##  $ lotSizeSqFt       : int  NA NA NA NA NA NA NA NA 12876 NA ...
##  $ taxAssessment     : num  NA 752407 1078954 248039 425741 ...
##  $ taxAssessmentYear : int  NA 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
##  $ totalRooms        : int  NA NA NA 4 NA NA NA NA NA NA ...
##  $ yearBuilt         : int  NA 1920 2015 2016 2015 1913 1916 1916 2012 1916 ...

rmarkdown::paged_table(zdata)

Checking for the datatypes of the column

sapply(zdata,class)

##                  X               zpid            address 
##          "integer"          "integer"        "character" 
##            zipcode               city              state 
##          "integer"        "character"        "character" 
##                lat               long        region_name 
##          "numeric"          "numeric"        "character" 
##          region_id               type          zestimate 
##          "integer"        "character"          "integer" 
##   zest_lastupdated zest_monthlychange    zest_percentile 
##        "character"          "integer"          "integer" 
##      zestimate_low     zestimate_high          compscore 
##          "integer"          "integer"          "integer" 
##          bathrooms           bedrooms       finishedSqFt 
##          "numeric"          "integer"          "integer" 
##       lastSoldDate      lastSoldPrice        lotSizeSqFt 
##        "character"          "numeric"          "integer" 
##      taxAssessment  taxAssessmentYear         totalRooms 
##          "numeric"          "integer"          "integer" 
##          yearBuilt 
##          "integer"

Data Wrangling

Converting the required columns to numeric

zdata$zpid<-as.numeric(zdata$zpid)
zdata$zestimate<-as.numeric(zdata$zestimate)
zdata$zestimate_low<-as.numeric(zdata$zestimate_low)
zdata$zestimate_high<-as.numeric(zdata$zestimate_high)
zdata$lotSizeSqFt<-as.numeric(zdata$lotSizeSqFt)

summary(zdata)

##        X              zpid             address             zipcode     
##  Min.   :    1   Min.   :3.000e+07   Length:24482       Min.   : 7030  
##  1st Qu.: 6121   1st Qu.:6.491e+07   Class :character   1st Qu.:11201  
##  Median :12242   Median :8.907e+07   Mode  :character   Median :11201  
##  Mean   :12242   Mean   :2.918e+08                      Mean   :11179  
##  3rd Qu.:18362   3rd Qu.:2.161e+08                      3rd Qu.:11216  
##  Max.   :24482   Max.   :2.147e+09                      Max.   :11385  
##                                                         NA's   :651    
##      city              state                lat             long       
##  Length:24482       Length:24482       Min.   :40.59   Min.   :-74.09  
##  Class :character   Class :character   1st Qu.:40.68   1st Qu.:-73.99  
##  Mode  :character   Mode  :character   Median :40.69   Median :-73.98  
##                                        Mean   :40.69   Mean   :-73.98  
##                                        3rd Qu.:40.70   3rd Qu.:-73.97  
##                                        Max.   :40.80   Max.   :-73.86  
##                                        NA's   :651     NA's   :651     
##  region_name          region_id          type             zestimate      
##  Length:24482       Min.   :  5837   Length:24482       Min.   : 173321  
##  Class :character   1st Qu.:270816   Class :character   1st Qu.: 748146  
##  Mode  :character   Median :272902   Mode  :character   Median : 992678  
##                     Mean   :290966                      Mean   :1187194  
##                     3rd Qu.:273903                      3rd Qu.:1411105  
##                     Max.   :403223                      Max.   :9217860  
##                     NA's   :651                         NA's   :1782     
##  zest_lastupdated   zest_monthlychange zest_percentile zestimate_low    
##  Length:24482       Min.   :-1249670   Min.   : 0.00   Min.   : 159455  
##  Class :character   1st Qu.:   -4801   1st Qu.:10.00   1st Qu.: 695776  
##  Mode  :character   Median :   -3022   Median :34.00   Median : 915005  
##                     Mean   :   -3700   Mean   :38.23   Mean   :1088827  
##                     3rd Qu.:   -2015   3rd Qu.:62.00   3rd Qu.:1282860  
##                     Max.   :  953637   Max.   :99.00   Max.   :8664788  
##                     NA's   :2116       NA's   :651     NA's   :1782     
##  zestimate_high      compscore        bathrooms         bedrooms     
##  Min.   : 186192   Min.   : 0.000   Min.   : 0.500   Min.   : 0.000  
##  1st Qu.: 804710   1st Qu.: 5.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :1074137   Median : 7.000   Median : 1.000   Median : 2.000  
##  Mean   :1294075   Mean   : 7.629   Mean   : 1.596   Mean   : 1.867  
##  3rd Qu.:1539740   3rd Qu.:10.000   3rd Qu.: 2.000   3rd Qu.: 2.000  
##  Max.   :9770932   Max.   :18.000   Max.   :12.000   Max.   :18.000  
##  NA's   :1782      NA's   :1786     NA's   :4311     NA's   :3772    
##   finishedSqFt    lastSoldDate       lastSoldPrice       lotSizeSqFt    
##  Min.   :     1   Length:24482       Min.   :      10   Min.   :     1  
##  1st Qu.:   750   Class :character   1st Qu.:  743000   1st Qu.:  1800  
##  Median :  1019   Mode  :character   Median :  960000   Median :  2260  
##  Mean   :  2613                      Mean   : 1135915   Mean   : 10488  
##  3rd Qu.:  1455                      3rd Qu.: 1369500   3rd Qu.:  4257  
##  Max.   :883265                      Max.   :62742225   Max.   :980100  
##  NA's   :5670                        NA's   :1362       NA's   :17439   
##  taxAssessment      taxAssessmentYear   totalRooms       yearBuilt   
##  Min.   :    1000   Min.   :2007      Min.   : 1.000   Min.   :1833  
##  1st Qu.:  150802   1st Qu.:2018      1st Qu.: 3.000   1st Qu.:1910  
##  Median :  223129   Median :2018      Median : 4.000   Median :1931  
##  Mean   : 1192765   Mean   :2017      Mean   : 4.407   Mean   :1952  
##  3rd Qu.:  586000   3rd Qu.:2018      3rd Qu.: 5.000   3rd Qu.:2005  
##  Max.   :81260000   Max.   :2019      Max.   :24.000   Max.   :2019  
##  NA's   :8516       NA's   :5376      NA's   :17244    NA's   :3660

In order to come up with a final dataset, we would be performing data cleaning: 1) Select data to ensure unique observations 2) Select data where valuation and bedrooms are non-missing 3) Removing outliers: -Remove homes greater than 5 bedrooms -Remove homes greater $10 million -Remove neighborhoods with less than 10 observations

zdata<-zdata[(zdata$zpid %in% unique(zdata$zpid)),]
zclean<-zdata[!is.na(zdata$bedrooms)&!is.na(zdata$zestimate),]
zclean<-filter(zclean,bedrooms<=5)
zclean<-filter(zclean,zestimate<=10000000)
neighborhoods<-c("Canarsie","Williamsburg","Ocean Hill","Brownsville","Chelsea","Vinegar Hill","Garment District","Greenwood","Red Hook","Wingate")
zclean<-zclean[!(zclean$region_name %in% neighborhoods),]

rmarkdown::paged_table(zclean)

Exploration and Discussion

Checking Price variablility based on distance

Since most of our addresses are in the NY and surrounding regions, we are considering Mid-town Manhattan as the focus point of our analysis. We are calculating the distance of all other properties from Mid-town, Manhattan to compare the variability in prices as the distnace changes. We are using the latitude and longitude information in the dataset for calculating the distance using “Crow-flies” formula as captured from geosphere package in R (https://cran.r-project.org/web/packages/geosphere/geosphere.pdf). Mid-town, Manhattan co-ordinates as reference: (ref_lat=40.75047, ref_long=-73.98961) After calculation, we are converting the distance from metres to miles.

ref_lat=40.75047
ref_long=-73.98961
ref_loc<-c(ref_long,ref_lat)
zclean$ref_long<-ref_long
zclean$ref_lat<-ref_lat

zclean$distance <- distGeo(zclean[,c('long','lat')], zclean[,c('ref_long','ref_lat')],a=6378137,f=1/298.257223563)
zclean$distance<-zclean$distance*0.000621371
rmarkdown::paged_table(zclean)

We are now looking at variation in price based on the region_namees, bedroom count and bathroom count.

zclean %>% count(region_name) %>% arrange(desc(n))

## # A tibble: 51 x 2
##    region_name            n
##    <chr>              <int>
##  1 Fort Greene         3480
##  2 Brooklyn Heights    3431
##  3 Park Slope          2499
##  4 DUMBO               2058
##  5 Downtown            1961
##  6 Clinton Hill        1234
##  7 Bedford Stuyvesant   697
##  8 Prospect Heights     666
##  9 Gowanus              572
## 10 Boerum Hill          359
## # ... with 41 more rows

zclean %>% count(bedrooms)

## # A tibble: 6 x 2
##   bedrooms     n
##      <int> <int>
## 1        0  1008
## 2        1  8398
## 3        2  5881
## 4        3  1929
## 5        4   540
## 6        5   672

zclean %>% count(bathrooms)%>% arrange(desc(n))

## # A tibble: 12 x 2
##    bathrooms     n
##        <dbl> <int>
##  1       1   10540
##  2       2    5210
##  3       1.5   763
##  4       3     752
##  5       4     486
##  6      NA     376
##  7       2.5   154
##  8       5     123
##  9       3.5    17
## 10       7       4
## 11       6       2
## 12       0.5     1

summarize(group_by(zclean,region_name),med=median(zestimate),avg=mean(zestimate))%>% arrange(desc(med))

## # A tibble: 51 x 3
##    region_name                              med      avg
##    <chr>                                  <dbl>    <dbl>
##  1 Flatlands                           4569779  4569779 
##  2 Midtown                             2930558  2930733 
##  3 Tribeca                             2850148  2850148 
##  4 Carroll Gardens                     2647610  2450236.
##  5 Inwood                              2038615  2038615 
##  6 Windsor Terrace                     1606190. 1540155.
##  7 Carnegie Hill                       1575946  1575946 
##  8 Sheepshead Bay                      1483645  1322685.
##  9 Columbia Street Waterfront District 1391213  1876804.
## 10 Boerum Hill                         1362111  1661109.
## # ... with 41 more rows

summarize(group_by(zclean,bedrooms),med=median(zestimate),avg=mean(zestimate))%>% arrange(desc(med))

## # A tibble: 6 x 3
##   bedrooms     med      avg
##      <int>   <dbl>    <dbl>
## 1        4 2038330 2338327.
## 2        5 1694277 2088816.
## 3        3 1569250 1748148.
## 4        2 1194324 1231649.
## 5        1  777033  836998.
## 6        0  579750  959679.

Data Exploration/Visualization

Plot1

The plot below shows the zestimate values for the properties.

options(scipen = 999)
plot<-ggplot(data=zclean,aes(x=zestimate))
plot+geom_histogram(color = 'blue', binwidth = 100000)+ggtitle("Zillow House Prices")

Plot importance

From the above distribution, we can infer that most of the properties are priced in the range of $100,000 and $2,500,000.

Plot2

The below plot shows the relationship between the square feet and zestimate value of the properties.

zclean$bedrooms<-as.factor(zclean$bedrooms)
zclean$bathrooms<-as.factor(zclean$bathrooms)

plot<-ggplot(data=zclean[zclean$lotSizeSqFt<10000&!is.na(zclean$lotSizeSqFt),],aes(x=lotSizeSqFt, y=zestimate))
plot+geom_point(aes(color=bedrooms))+ggtitle("Zestimate vs Sq.Ft")+ylab("Property Value in $")+scale_y_continuous(breaks=seq(0,10000000,2000000))

Plot importance

From the above distribution, we can infer that most of the properties are sized between 1250 to 2500sqft, the same range also contains the most varied houses with respect to the number of bedrooms.

Plot3

The below plot shows the relationship between distance and zestimate value.

plot<-ggplot(data=zclean,aes(x=distance, y=zestimate))
plot+geom_point(aes(color=bedrooms))+ggtitle("Zestimate vs Distance")+ylab("Property Value in $")+xlab("Distance in miles")+scale_y_continuous(breaks=seq(0,10000000,2000000))

Plot importance

From the above distribution, we can infer that most of the properties,lie in the distance range of 3 to 6 miles from Mid-town, Manhattan and among these properties, most of them are priced below 2 million. The lower priced properties have majorly 0 to 3 bedrooms. We can also see that, as the number of bedrooms increases, the property value also increases.

Plot4

In the below plot we represent property price distribution with respect to the number of bedrooms via box plots.

ggplot(zclean, aes(y=zestimate, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Property Value in $")+xlab("Bedrooms")+ggtitle("Box Plots of Property Value")

Plot importance

From the above distribution, we can infer that with an increase in the number of bedrooms, the property price increases. Considering properties with 4 or 5 bedrooms, we can see that the median price of a 4 bedroom property is higher than that of a 5 bedroom property, which is surprising!. This could be either because many 4 bedroom properties are located in more expensive locations or some other factor influencing the prices.

Plot5

In the below plot we represent property price distribution with respect to property size via box plots.

ggplot(zclean[zclean$lotSizeSqFt<10000&!is.na(zclean$lotSizeSqFt),], aes(y=lotSizeSqFt, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Size in Square Feet")+xlab("Bedrooms")+ggtitle("Box Plots of Property Size")

Plot importance

From the above distribution, we can see that 75-100% of the properties with 0,3,4 and 5 bedrooms have a size less than 2500 square feet. However, more than 75% of the properties with 1 and 2 bedrooms have a property size more than 2500 square feet. Here, we can clearly see that our data is skewed and hence may affect the predicition by our models.

Plot 6

In the below plot we compare the prices of properties among different area of central brooklyn, by filtering with the area’s zipcodes.

zclean1 <- zclean
head(zclean1)

##   X      zpid                 address zipcode     city state      lat
## 1 3 245067526   130 Furman St APT 408   11201 Brooklyn    NY 40.70043
## 2 4 245014557        51 Jay St APT 1M   11201 Brooklyn    NY 40.70341
## 3 5 245012575    90 Furman St APT 801   11201 Brooklyn    NY 40.70154
## 4 6  30566078        1 Main St APT 3A   11201 Brooklyn    NY 40.70354
## 5 7  30566079        1 Main St APT 3B   11201 Brooklyn    NY 40.70354
## 6 8  68312584 70 Washington St APT 8D   11201 Brooklyn    NY 40.70208
##        long      region_name region_id         type zestimate
## 1 -73.99640 Brooklyn Heights    403122 neighborhood   4903925
## 2 -73.98628            DUMBO    270841 neighborhood   2062475
## 3 -73.99587 Brooklyn Heights    403122 neighborhood   2376619
## 4 -73.99023            DUMBO    270841 neighborhood   3321319
## 5 -73.99023            DUMBO    270841 neighborhood   4405368
## 6 -73.98994            DUMBO    270841 neighborhood   2351179
##   zest_lastupdated zest_monthlychange zest_percentile zestimate_low
## 1        12/1/2019              52400              97       4413533
## 2        12/1/2019              -6919              83       1918102
## 3        12/1/2019              -8769              88       2186489
## 4        12/1/2019             -13587              95       3055613
## 5        12/1/2019             -18049              97       4052939
## 6        12/1/2019              -9633              88       2163085
##   zestimate_high compscore bathrooms bedrooms finishedSqFt lastSoldDate
## 1        5443357         5         4        3         4163     7/2/2019
## 2        2206848        11         3        2         1451     7/9/2019
## 3        2566749        11         1        1         1633    4/16/2019
## 4        3620238         4       2.5        2         2198    1/11/2019
## 5        4801851         3       3.5        3         2592    1/10/2019
## 6        2562785        10         3        3         1700    1/10/2019
##   lastSoldPrice lotSizeSqFt taxAssessment taxAssessmentYear totalRooms
## 1       4291240          NA       1078954              2018         NA
## 2       2078000          NA        248039              2018          4
## 3       2400000          NA        425741              2018         NA
## 4       3355000          NA        302233              2018         NA
## 5       4450000          NA        466341              2018         NA
## 6       2375000          NA        372245              2018         NA
##   yearBuilt  ref_long  ref_lat distance
## 1      2015 -73.98961 40.75047 3.470907
## 2      2016 -73.98961 40.75047 3.252032
## 3      2015 -73.98961 40.75047 3.391957
## 4      1913 -73.98961 40.75047 3.238517
## 5      1916 -73.98961 40.75047 3.238517
## 6      1916 -73.98961 40.75047 3.338864

zclean1<-filter(zclean1,zipcode %in% c("11212", "11216", "11233", "11238"))

zclean1$zipcode<-as.factor(zclean1$zipcode)  
plot5<-ggplot(data = zclean1,aes(x = zipcode, y = zestimate)) + 
  geom_point() +
  labs(x = "zip", y = "price")

plot5

Plot importance

From the above distribution, we can see that for the zip 11238, the property prices are the highest when compared to the other areas. Also, there are more number of properties in this zipcode, reason could be a preferred and safe living area for people ccausing high prices.

ML Procedure

modeldf <- zclean 
modeldf$bathrooms<-as.numeric(modeldf$bathrooms)
modeldf$bedrooms<-as.numeric(modeldf$bedrooms)

modeldf <- modeldf %>% mutate_if(., is.numeric, ~replace(., is.na(.), 0))

Splitting data into training and testing sets

set.seed(123)
training.samples <- modeldf$zestimate %>%createDataPartition(p = 0.8, list = FALSE)
train.data  <- modeldf[training.samples, ]
test.data <- modeldf[-training.samples, ]

Linear Regression

lm_model1 <- lm(zestimate ~ zestimate_low+zestimate_high+zest_monthlychange+zest_percentile+compscore+bedrooms+bathrooms+
                  finishedSqFt+lastSoldPrice+lotSizeSqFt , data = train.data)

predictions <- lm_model1 %>% predict(test.data)
data.frame(
  R2 = R2(predictions, test.data$zestimate)
)

##          R2
## 1 0.9996352

As seen from the obove result, the value of R2 is approximately 1.It occurs when there are high correlations among predictor variables, leading to unreliable and unstable estimates of regression coefficients. So next we look out for multi-collinearity to remove such features from our analysis.

Check to overcome multi collinearity

car::vif(lm_model1)

##      zestimate_low     zestimate_high zest_monthlychange 
##          64.036310          60.173868           1.043842 
##    zest_percentile          compscore           bedrooms 
##           1.985705           1.056540           2.645083 
##          bathrooms       finishedSqFt      lastSoldPrice 
##           2.953741           1.146685           2.755964 
##        lotSizeSqFt 
##           1.144532

Looking at the above output, we can see that the features, zestimate_low an zestimate_high show high values of VIF (Variance Inflation Factor). These have high dependence on our prediction. Hence, we will remove these 2 features and use the remaining features for modeling and prediction.

Re-running the linear regression model with updated feature list:

lm_model1 <- lm(zestimate ~ zest_monthlychange+zest_percentile+compscore+bedrooms+bathrooms+finishedSqFt+lastSoldPrice+lotSizeSqFt , data = train.data)
# Make predictions
predictions <- lm_model1 %>% predict(test.data)
# Model performance
data.frame(
  R2 = R2(predictions, test.data$zestimate)
)

##          R2
## 1 0.7321444

From the above result, we can see that the R2 has reduced by removing the features causing multicolinearity. We would be using the same featues for different H2O models moving on.

Data Setup for H2O

Zmodeldf <- modeldf[,c("zest_monthlychange","zest_percentile",
                       "compscore","bedrooms","bathrooms","finishedSqFt",
                       "lastSoldPrice","lotSizeSqFt","zestimate")]

set.seed(123)
training.samples <- Zmodeldf$zestimate %>%createDataPartition(p = 0.8, list = FALSE)
train.data  <- Zmodeldf[training.samples, ]
test.data <- Zmodeldf[-training.samples, ]

Creating H2O dataframe

kd_h2o<-h2o.init(nthreads = -1,max_mem_size = "16g")

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\RANJAN~1\AppData\Local\Temp\Rtmp0ca7Ct/h2o_Ranjani_Krishna_started_from_r.out
##     C:\Users\RANJAN~1\AppData\Local\Temp\Rtmp0ca7Ct/h2o_Ranjani_Krishna_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         3 seconds 376 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.26.0.2 
##     H2O cluster version age:    4 months and 12 days !!! 
##     H2O cluster name:           H2O_started_from_R_Ranjani_Krishna_hzp306 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   16.00 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.3 (2019-03-11)

install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
library(h2oEnsemble)

data_h2o <- as.h2o(
  train.data,
  destination_frame= "train.hex"
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

new_data_h2o <- as.h2o(
  test.data,
  destination_frame= "test.hex" 
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

splits <- h2o.splitFrame(data = data_h2o,
                         ratios = c(0.7, 0.15), 
                         seed = 1234)
train_h2o <- splits[[1]] 
valid_h2o <- splits[[2]] 
test_h2o <- splits[[3]]

y <- "zestimate" 
x <- setdiff(names(train_h2o), y)

GLM : Generalized Linear Model

glm <- h2o.glm(family= "gaussian", x= x, y=y, training_frame=train_h2o, lambda = 0, compute_p_values = TRUE)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Summary of GLM Model:

h2o.performance(glm, newdata = test_h2o)

## H2ORegressionMetrics: glm
## 
## MSE:  110397214797
## RMSE:  332260.8
## MAE:  160639.1
## RMSLE:  0.2483794
## Mean Residual Deviance :  110397214797
## R^2 :  0.7604126
## Null Deviance :1000819984321746
## Null D.o.F. :2171
## Residual Deviance :239782750539539
## Residual D.o.F. :2163
## AIC :61412.07

summary(glm)

## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model Key:  GLM_model_R_1575930910406_1 
## GLM Model: summary
##     family     link regularization number_of_predictors_total
## 1 gaussian identity           None                          8
##   number_of_active_predictors number_of_iterations  training_frame
## 1                           8                    1 RTMP_sid_9972_3
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  76018393414
## RMSE:  275714.3
## MAE:  150023.9
## RMSLE:  NaN
## Mean Residual Deviance :  76018393414
## R^2 :  0.818125
## Null Deviance :4328083861864295
## Null D.o.F. :10354
## Residual Deviance :787170463805174
## Residual D.o.F. :10346
## AIC :288842.9
## 
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration iterations negative_log_likelihood
## 1 2019-12-09 17:35:23  0.000 sec          0  4328083861864379.00000
##            objective
## 1 417970435718.43402
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
##             variable relative_importance scaled_importance  percentage
## 1      lastSoldPrice          453205.151        1.00000000 0.580899738
## 2    zest_percentile           83059.522        0.18327136 0.106462282
## 3          bathrooms           78925.904        0.17415050 0.101163980
## 4           bedrooms           55847.410        0.12322766 0.071582915
## 5 zest_monthlychange           46955.230        0.10360701 0.060185284
## 6          compscore           38903.608        0.08584105 0.049865045
## 7        lotSizeSqFt           16484.249        0.03637260 0.021128833
## 8       finishedSqFt            6796.849        0.01499729 0.008711922

Random forest - 5-Fold Cross-validation

rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train_h2o,
                          ntrees = 50,
                          nfolds = 5,
                          fold_assignment = "Modulo",
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

Summary of Random Forest Model:

h2o.performance(rf, newdata = test_h2o)

## H2ORegressionMetrics: drf
## 
## MSE:  21480774857
## RMSE:  146563.2
## MAE:  20618.69
## RMSLE:  0.05751715
## Mean Residual Deviance :  21480774857
## R^2 :  0.9533818

summary(rf)

## Model Details:
## ==============
## 
## H2ORegressionModel: drf
## Model Key:  DRF_model_R_1575930910406_2 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50              655699        20
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1        20   20.00000        877       1224  1038.24000
## 
## H2ORegressionMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  6321015409
## RMSE:  79504.81
## MAE:  19316.91
## RMSLE:  0.06159496
## Mean Residual Deviance :  6321015409
## R^2 :  0.9848769
## 
## 
## 
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  6925395930
## RMSE:  83218.96
## MAE:  19948.72
## RMSLE:  0.06362259
## Mean Residual Deviance :  6925395930
## R^2 :  0.9834309
## 
## 
## Cross-Validation Metrics Summary: 
##                              mean           sd  cv_1_valid  cv_2_valid
## mae                      19948.72     783.6792   19904.775   21254.818
## mean_residual_deviance 6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## mse                    6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## r2                     0.98336756 0.0032172317   0.9874468  0.97951263
## residual_deviance      6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## rmse                     82426.24    8102.8013    71796.97    92970.67
## rmsle                  0.06334415 0.0042043244  0.06040833  0.06799258
##                         cv_3_valid  cv_4_valid  cv_5_valid
## mae                      20203.295   17921.762   20458.953
## mean_residual_deviance   9.21263E9 4.4630144E9 7.1529846E9
## mse                      9.21263E9 4.4630144E9 7.1529846E9
## r2                       0.9781843   0.9898994  0.98179466
## residual_deviance        9.21263E9 4.4630144E9 7.1529846E9
## rmse                     95982.445     66805.8    84575.32
## rmsle                  0.053334225  0.06505199  0.06993362
## 
## Scoring History: 
##             timestamp   duration number_of_trees training_rmse
## 1 2019-12-09 17:35:31  4.899 sec               0            NA
## 2 2019-12-09 17:35:31  4.931 sec               1  159178.42101
## 3 2019-12-09 17:35:31  4.958 sec               2  131413.82523
## 4 2019-12-09 17:35:31  4.990 sec               3  114932.15584
## 5 2019-12-09 17:35:31  5.019 sec               4  114006.43832
##   training_mae training_deviance
## 1           NA                NA
## 2  27712.46049 25337769715.05450
## 3  24840.05331 17269593461.08600
## 4  24849.21345 13209400445.23510
## 5  23899.69515 12997467978.91330
## 
## ---
##              timestamp   duration number_of_trees training_rmse
## 46 2019-12-09 17:35:32  5.702 sec              45   79164.89838
## 47 2019-12-09 17:35:32  5.718 sec              46   79208.66212
## 48 2019-12-09 17:35:32  5.734 sec              47   78990.18627
## 49 2019-12-09 17:35:32  5.749 sec              48   78918.56105
## 50 2019-12-09 17:35:32  5.764 sec              49   79298.08070
## 51 2019-12-09 17:35:32  5.779 sec              50   79504.81375
##    training_mae training_deviance
## 46  19215.32893  6267081135.49112
## 47  19193.52363  6274012155.13533
## 48  19162.30714  6239449527.08160
## 49  19086.95632  6228139277.74415
## 50  19374.80080  6288185603.14960
## 51  19316.91135  6321015408.70604
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable      relative_importance scaled_importance percentage
## 1      lastSoldPrice 60314599399882752.000000          1.000000   0.364607
## 2    zest_percentile 41767354362757120.000000          0.692492   0.252487
## 3          bathrooms 16725414399442944.000000          0.277303   0.101107
## 4           bedrooms 16646272110821376.000000          0.275991   0.100628
## 5 zest_monthlychange 13637810728730624.000000          0.226111   0.082442
## 6       finishedSqFt 13586902481371136.000000          0.225267   0.082134
## 7        lotSizeSqFt  1419359184486400.000000          0.023533   0.008580
## 8          compscore  1325963778457600.000000          0.021984   0.008016

GBM

gbm <- h2o.gbm(x = x, y = y, training_frame = train_h2o)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |=================================================================| 100%

Summary of GBM Model:

h2o.performance(gbm, newdata = valid_h2o)

## H2ORegressionMetrics: gbm
## 
## MSE:  7833179156
## RMSE:  88505.25
## MAE:  30298.04
## RMSLE:  0.07396035
## Mean Residual Deviance :  7833179156
## R^2 :  0.9806906

summary(gbm)

## Model Details:
## ==============
## 
## H2ORegressionModel: gbm
## Model Key:  GBM_model_R_1575930910406_3 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               19678         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          9         32    26.62000
## 
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  4634218128
## RMSE:  68075.09
## MAE:  27885.35
## RMSLE:  0.06843786
## Mean Residual Deviance :  4634218128
## R^2 :  0.9889126
## 
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration number_of_trees training_rmse
## 1 2019-12-09 17:35:33  0.008 sec               0  646506.33076
## 2 2019-12-09 17:35:33  0.097 sec               1  586284.50554
## 3 2019-12-09 17:35:33  0.116 sec               2  532849.96123
## 4 2019-12-09 17:35:33  0.134 sec               3  485256.66403
## 5 2019-12-09 17:35:33  0.151 sec               4  442673.67807
##   training_mae  training_deviance
## 1 448967.34444 417970435718.43402
## 2 406178.52896 343729521433.18701
## 3 367961.35604 283929081181.36298
## 4 333771.33120 235474029989.16101
## 5 302862.50200 195959985253.84900
## 
## ---
##              timestamp   duration number_of_trees training_rmse
## 46 2019-12-09 17:35:34  0.767 sec              45   70579.90250
## 47 2019-12-09 17:35:34  0.773 sec              46   69816.20037
## 48 2019-12-09 17:35:34  0.778 sec              47   69520.07912
## 49 2019-12-09 17:35:34  0.783 sec              48   69035.93236
## 50 2019-12-09 17:35:34  0.788 sec              49   68696.22288
## 51 2019-12-09 17:35:34  0.793 sec              50   68075.09184
##    training_mae training_deviance
## 46  29181.00324  4981522637.39805
## 47  28843.56035  4874301834.58284
## 48  28634.12138  4833041400.67046
## 49  28308.89244  4765959957.29330
## 50  28173.78544  4719171037.62527
## 51  27885.35004  4634218128.39442
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable      relative_importance scaled_importance percentage
## 1      lastSoldPrice 17210850930589696.000000          1.000000   0.764016
## 2    zest_percentile  3733240066080768.000000          0.216912   0.165724
## 3 zest_monthlychange   480851754221568.000000          0.027939   0.021346
## 4          bathrooms   421534866866176.000000          0.024492   0.018713
## 5       finishedSqFt   385460731904000.000000          0.022396   0.017111
## 6           bedrooms   141153043218432.000000          0.008201   0.006266
## 7          compscore    96304801775616.000000          0.005596   0.004275
## 8        lotSizeSqFt    57428733329408.000000          0.003337   0.002549

GBM with parameters

gbm2 <- h2o.gbm(
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  ntrees = 10000,
  learn_rate=0.01,
  stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "deviance",
  sample_rate = 0.8,
  col_sample_rate = 0.8,
  seed = 1234,
  score_tree_interval = 10
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |===================================                              |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |=======================================                          |  59%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=================================================================| 100%

Summary of GBM with Parameters Model:

h2o.performance(gbm2, newdata = valid_h2o)

## H2ORegressionMetrics: gbm
## 
## MSE:  4462703024
## RMSE:  66803.47
## MAE:  13315.93
## RMSLE:  0.04603479
## Mean Residual Deviance :  4462703024
## R^2 :  0.9889991

summary(gbm2)

## Model Details:
## ==============
## 
## H2ORegressionModel: gbm
## Model Key:  GBM_model_R_1575930910406_4 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1            7990                     7990             2191678         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          6         32    17.00976
## 
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  464115103
## RMSE:  21543.33
## MAE:  8174.98
## RMSLE:  0.03078964
## Mean Residual Deviance :  464115103
## R^2 :  0.9988896
## 
## 
## H2ORegressionMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  4462703024
## RMSE:  66803.47
## MAE:  13315.93
## RMSLE:  0.04603479
## Mean Residual Deviance :  4462703024
## R^2 :  0.9889991
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration number_of_trees training_rmse
## 1 2019-12-09 17:35:35  0.005 sec               0  646506.33076
## 2 2019-12-09 17:35:35  0.093 sec              10  589353.76774
## 3 2019-12-09 17:35:35  0.176 sec              20  538040.45504
## 4 2019-12-09 17:35:35  0.262 sec              30  491783.48845
## 5 2019-12-09 17:35:36  0.340 sec              40  449778.66465
##   training_mae  training_deviance validation_rmse validation_mae
## 1 448967.34444 417970435718.43402    637050.85923   450053.29183
## 2 408059.45012 347337863546.77899    580371.91810   409206.47595
## 3 371341.22583 289487531257.14899    529490.63470   372597.02335
## 4 338278.54163 241850999509.10001    483573.97607   339700.18971
## 5 308463.40420 202300847174.01501    442087.84020   310076.40503
##   validation_deviance
## 1  405833797241.62299
## 2  336831563324.61298
## 3  280360332230.79102
## 4  233843790330.26300
## 5  195441658448.45801
## 
## ---
##               timestamp          duration number_of_trees training_rmse
## 795 2019-12-09 17:36:35  1 min  0.101 sec            7940   21619.63196
## 796 2019-12-09 17:36:35  1 min  0.168 sec            7950   21604.29902
## 797 2019-12-09 17:36:35  1 min  0.233 sec            7960   21589.35840
## 798 2019-12-09 17:36:35  1 min  0.296 sec            7970   21568.35333
## 799 2019-12-09 17:36:36  1 min  0.365 sec            7980   21558.19991
## 800 2019-12-09 17:36:36  1 min  0.433 sec            7990   21543.33084
##     training_mae training_deviance validation_rmse validation_mae
## 795   8209.12485   467408485.95397     66801.34474    13340.36944
## 796   8200.83489   466745736.20928     66805.07833    13335.90246
## 797   8193.25902   466100396.07704     66812.79875    13329.59628
## 798   8186.15375   465193865.33732     66809.72601    13321.35402
## 799   8181.86634   464755983.38129     66799.23369    13319.46667
## 800   8174.97979   464115103.48381     66803.46565    13315.92922
##     validation_deviance
## 795    4462419658.85476
## 796    4462918490.36703
## 797    4463950076.67758
## 798    4463539489.09355
## 799    4462137621.82529
## 800    4462703023.50955
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable       relative_importance scaled_importance
## 1      lastSoldPrice 132569525710225408.000000          1.000000
## 2    zest_percentile  45626773320237056.000000          0.344172
## 3          bathrooms   7427879113588736.000000          0.056030
## 4 zest_monthlychange   4609099593416704.000000          0.034767
## 5       finishedSqFt   2975650247868416.000000          0.022446
## 6           bedrooms   2120657550704640.000000          0.015997
## 7          compscore   1022267580481536.000000          0.007711
## 8        lotSizeSqFt    460563469565952.000000          0.003474
##   percentage
## 1   0.673583
## 2   0.231829
## 3   0.037741
## 4   0.023419
## 5   0.015119
## 6   0.010775
## 7   0.005194
## 8   0.002340

Deep Learning

DeepLearning: Model 1

m1 <- h2o.deeplearning(
  model_id = "dl_model_first",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o, 
  epochs = 10
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |=================================================================| 100%

Summary of Simple Deep Learning model:

h2o.performance(m1, newdata = valid_h2o)

## H2ORegressionMetrics: deeplearning
## 
## MSE:  161921026841
## RMSE:  402394.1
## MAE:  122740.1
## RMSLE:  0.1789928
## Mean Residual Deviance :  161921026841

summary(m1)

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  dl_model_first 
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 42,201 weights/biases, 503.8 KB, 103,550 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1     8     Input  0.00 %       NA       NA        NA       NA
## 2     2   200 Rectifier  0.00 % 0.000000 0.000000  0.014919 0.030916
## 3     3   200 Rectifier  0.00 % 0.000000 0.000000  0.074538 0.152987
## 4     4     1    Linear      NA 0.000000 0.000000  0.005602 0.069532
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.000000    0.005605   0.105590  0.482493 0.018734
## 3 0.000000   -0.011055   0.072601  0.981213 0.012322
## 4 0.000000    0.009243   0.081288 -0.001783 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9960 samples **
## 
## MSE:  58737517116
## RMSE:  242358.2
## MAE:  117928.8
## RMSLE:  0.1654791
## Mean Residual Deviance :  58737517116
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  161921026841
## RMSE:  402394.1
## MAE:  122740.1
## RMSLE:  0.1789928
## Mean Residual Deviance :  161921026841
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration training_speed   epochs iterations
## 1 2019-12-09 17:36:46  0.000 sec             NA  0.00000          0
## 2 2019-12-09 17:36:47  1.500 sec   9270 obs/sec  1.00000          1
## 3 2019-12-09 17:36:51  5.179 sec  22379 obs/sec 10.00000         10
## 4 2019-12-09 17:36:51  5.348 sec  22355 obs/sec 10.00000         10
##         samples training_rmse training_deviance training_mae training_r2
## 1      0.000000            NA                NA           NA          NA
## 2  10355.000000  242358.24128 58737517116.42230 117928.77556     0.86023
## 3 103550.000000  156107.63355 24369593252.66580  66020.08124     0.94201
## 4 103550.000000  242358.24128 58737517116.42230 117928.77556     0.86023
##   validation_rmse validation_deviance validation_mae validation_r2
## 1              NA                  NA             NA            NA
## 2    402394.11880  161921026841.07401   122740.09528       0.60085
## 3    518178.87052  268509341855.56799    76498.22400       0.33810
## 4    402394.11880  161921026841.07401   122740.09528       0.60085
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable relative_importance scaled_importance percentage
## 1      lastSoldPrice            1.000000          1.000000   0.138576
## 2           bedrooms            0.936130          0.936130   0.129725
## 3    zest_percentile            0.927775          0.927775   0.128568
## 4       finishedSqFt            0.908388          0.908388   0.125881
## 5 zest_monthlychange            0.871765          0.871765   0.120806
## 6        lotSizeSqFt            0.865988          0.865988   0.120005
## 7          compscore            0.860029          0.860029   0.119180
## 8          bathrooms            0.846168          0.846168   0.117259

DeepLearning: Model 2

m2 <- h2o.deeplearning(
  model_id = "dl_model_faster",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  hidden = c(32,32,32),
  epochs = 1000000, 
  score_validation_samples = 10000, 
  stopping_metric = "deviance", 
  stopping_rounds = 2, 
  stopping_tolerance = 0.01 
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Summary of Deep Learning model with parameters:

h2o.performance(m2, newdata = valid_h2o)

## H2ORegressionMetrics: deeplearning
## 
## MSE:  5566387605
## RMSE:  74608.23
## MAE:  34382.73
## RMSLE:  0.06439333
## Mean Residual Deviance :  5566387605

summary(m2)

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  dl_model_faster 
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 2,433 weights/biases, 34.8 KB, 5,700,307 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1     8     Input  0.00 %       NA       NA        NA       NA
## 2     2    32 Rectifier  0.00 % 0.000000 0.000000  0.014140 0.040262
## 3     3    32 Rectifier  0.00 % 0.000000 0.000000  0.008040 0.009206
## 4     4    32 Rectifier  0.00 % 0.000000 0.000000  0.011672 0.031162
## 5     5     1    Linear      NA 0.000000 0.000000  0.002173 0.002203
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.000000   -0.054549   0.572098  0.429086 0.260124
## 3 0.000000   -0.028055   0.338395  1.065636 0.176645
## 4 0.000000   -0.061278   0.389261  0.979380 0.184730
## 5 0.000000    0.100799   0.508021 -0.115427 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10039 samples **
## 
## MSE:  4893882614
## RMSE:  69956.29
## MAE:  31373.65
## RMSLE:  0.05667772
## Mean Residual Deviance :  4893882614
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  5566387605
## RMSE:  74608.23
## MAE:  34382.73
## RMSLE:  0.06439333
## Mean Residual Deviance :  5566387605
## 
## 
## 
## 
## Scoring History: 
##              timestamp   duration training_speed    epochs iterations
## 1  2019-12-09 17:36:52  0.000 sec             NA   0.00000          0
## 2  2019-12-09 17:36:53  1.186 sec  90038 obs/sec   9.65167          1
## 3  2019-12-09 17:36:58  6.362 sec 111730 obs/sec  67.57779          7
## 4  2019-12-09 17:37:04 11.688 sec 103754 obs/sec 115.86818         12
## 5  2019-12-09 17:37:09 16.740 sec 102456 obs/sec 164.15838         17
## 6  2019-12-09 17:37:14 22.623 sec 102464 obs/sec 222.12738         23
## 7  2019-12-09 17:37:20 28.207 sec 103546 obs/sec 280.08054         29
## 8  2019-12-09 17:37:26 33.722 sec 104483 obs/sec 338.01989         35
## 9  2019-12-09 17:37:31 39.298 sec 104994 obs/sec 395.96736         41
## 10 2019-12-09 17:37:37 45.247 sec 104508 obs/sec 453.90497         47
## 11 2019-12-09 17:37:43 51.131 sec 102317 obs/sec 502.22231         52
## 12 2019-12-09 17:37:49 56.834 sec 100893 obs/sec 550.48836         57
## 13 2019-12-09 17:37:49 56.873 sec 100883 obs/sec 550.48836         57
##           samples training_rmse training_deviance training_mae training_r2
## 1        0.000000            NA                NA           NA          NA
## 2    99943.000000  185013.12360 34229855902.43820  75405.92330     0.91863
## 3   699768.000000  143728.47792 20657875364.59090  52210.24068     0.95089
## 4  1199815.000000   89761.45262  8057118375.63268  36729.50837     0.98085
## 5  1699860.000000   85179.68455  7255578660.61776  34267.42185     0.98275
## 6  2300129.000000   67846.06521  4603088564.02884  33040.14791     0.98906
## 7  2900234.000000   86428.87419  7469950293.73827  36664.20307     0.98224
## 8  3500196.000000   61222.23228  3748161725.10958  26510.60201     0.99109
## 9  4100242.000000   69956.29074  4893882613.93122  31373.64533     0.98837
## 10 4700186.000000   69353.93636  4809968488.58036  26940.31179     0.98857
## 11 5200512.000000   56987.28425  3247550566.56815  24026.14405     0.99228
## 12 5700307.000000   61551.59809  3788599226.88589  23699.54617     0.99099
## 13 5700307.000000   69956.29074  4893882613.93122  31373.64533     0.98837
##    validation_rmse validation_deviance validation_mae validation_r2
## 1               NA                  NA             NA            NA
## 2     511520.83126  261653560817.42599    83725.78250       0.35500
## 3     451171.70759  203555909732.73499    58419.87902       0.49822
## 4     343676.32588  118113416970.67000    45551.34609       0.70884
## 5     168364.30640   28346539670.51930    37317.98800       0.93012
## 6     222894.87990   49682127485.49180    38898.44966       0.87753
## 7     232436.16148   54026569165.57670    44597.31003       0.86682
## 8     132835.74274   17645334548.68890    31750.76144       0.95650
## 9      74608.22746    5566387604.86988    34382.72921       0.98628
## 10    129563.94402   16786815590.53130    30931.32585       0.95862
## 11    245590.68768   60314785875.98910    32127.53658       0.85132
## 12    158810.17267   25220670945.03330    29481.44544       0.93783
## 13     74608.22746    5566387604.86988    34382.72921       0.98628
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable relative_importance scaled_importance percentage
## 1       finishedSqFt            1.000000          1.000000   0.302375
## 2 zest_monthlychange            0.487788          0.487788   0.147495
## 3      lastSoldPrice            0.382055          0.382055   0.115524
## 4           bedrooms            0.377211          0.377211   0.114059
## 5    zest_percentile            0.367212          0.367212   0.111036
## 6          bathrooms            0.332577          0.332577   0.100563
## 7        lotSizeSqFt            0.312363          0.312363   0.094451
## 8          compscore            0.047940          0.047940   0.014496

Deep learning Model 3

m3 <- h2o.deeplearning(
  model_id="dl_model_tuned",
  x = x,
  y = y,
  training_frame = train_h2o,
  validation_frame = valid_h2o,
  overwrite_with_best_model = F, 
  hidden = c(50,50,50), 
  epochs = 10,
  score_validation_samples = 100, 
  score_duty_cycle = 0.025, 
  adaptive_rate = F, 
  rate = 0.01,
  rate_annealing = 2e-6,
  momentum_start = 0.2, 
  momentum_stable = 0.4,
  momentum_ramp = 1e7,
  l1 = 1e-5, 
  l2 = 1e-5,
  max_w2 = 10 
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Summary of Deep Learning model with different parameters:

h2o.performance(m3, newdata = valid_h2o)

## H2ORegressionMetrics: deeplearning
## 
## MSE:  77240460743
## RMSE:  277921.7
## MAE:  120818
## RMSLE:  0.1825201
## Mean Residual Deviance :  77240460743

summary(m3)

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model Key:  dl_model_tuned 
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 5,601 weights/biases, 50.1 KB, 103,550 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1     8     Input  0.00 %       NA       NA        NA       NA
## 2     2    50 Rectifier  0.00 % 0.000010 0.000010  0.008284 0.000000
## 3     3    50 Rectifier  0.00 % 0.000010 0.000010  0.008284 0.000000
## 4     4    50 Rectifier  0.00 % 0.000010 0.000010  0.008284 0.000000
## 5     5     1    Linear      NA 0.000010 0.000010  0.008284 0.000000
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.202071    0.269970   0.870794 -9.678688 4.102140
## 3 0.202071   -0.296590   0.255310 -0.502205 1.739980
## 4 0.202071   -0.187645   0.245874  0.509115 0.996306
## 5 0.202071   -0.124947   0.413147  3.684738 0.000000
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10001 samples **
## 
## MSE:  76210131771
## RMSE:  276061.8
## MAE:  119784.4
## RMSLE:  0.1791665
## Mean Residual Deviance :  76210131771
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on temporary validation frame with 109 samples **
## 
## MSE:  140049267268
## RMSE:  374231.6
## MAE:  170786.5
## RMSLE:  0.2173632
## Mean Residual Deviance :  140049267268
## 
## 
## 
## 
## Scoring History: 
##             timestamp   duration training_speed   epochs iterations
## 1 2019-12-09 17:37:50  0.000 sec             NA  0.00000          0
## 2 2019-12-09 17:37:51  0.959 sec  12959 obs/sec  1.00000          1
## 3 2019-12-09 17:37:52  1.887 sec  62266 obs/sec 10.00000         10
##         samples training_rmse  training_deviance training_mae training_r2
## 1      0.000000            NA                 NA           NA          NA
## 2  10355.000000  340140.80866 115695769713.66499 221958.34388     0.72388
## 3 103550.000000  276061.82599  76210131771.44321 119784.40116     0.81812
##   validation_rmse validation_deviance validation_mae validation_r2
## 1              NA                  NA             NA            NA
## 2    324890.84750  105554062792.26601   235233.00513       0.73595
## 3    374231.56904  140049267267.82001   170786.54067       0.64965
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##             variable relative_importance scaled_importance percentage
## 1           bedrooms            1.000000          1.000000   0.295851
## 2          bathrooms            0.889465          0.889465   0.263149
## 3    zest_percentile            0.644289          0.644289   0.190614
## 4          compscore            0.322556          0.322556   0.095429
## 5        lotSizeSqFt            0.177922          0.177922   0.052638
## 6      lastSoldPrice            0.154741          0.154741   0.045780
## 7 zest_monthlychange            0.117374          0.117374   0.034725
## 8       finishedSqFt            0.073730          0.073730   0.021813

Hyper-Parameter Grid Search- Deep Learning

hyper_params <- list(
  hidden=list(c(20,20),c(50,50),c(30,30,30),c(25,25,25,25)),
  input_dropout_ratio=c(0,0.05),
  l1=seq(0,1e-4,1e-6),
  l2=seq(0,1e-4,1e-6)
)

search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 360, max_models = 100, seed=1234567, stopping_rounds=5, stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
  algorithm="deeplearning",
  grid_id = "dl_grid_random",
  training_frame= train_h2o,
  validation_frame= valid_h2o, 
  x=x, 
  y=y,
  epochs=1,
  stopping_metric="deviance",
  stopping_tolerance=1e-2,        
  stopping_rounds=2,
  score_validation_samples=10000, 
  score_duty_cycle=0.025,         
  max_w2=10,                      
  hyper_params = hyper_params,
  search_criteria = search_criteria
)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |=====================================                            |  58%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |=================================================================| 100%

grid <- h2o.getGrid("dl_grid_random",sort_by="rmsle",decreasing=FALSE)

Summary of Hyper-Parameter Grid Search- Deep Learning

grid@summary_table[1,]

## Hyper-Parameter Search Summary: ordered by increasing rmsle
##             hidden input_dropout_ratio     l1     l2
## 1 [25, 25, 25, 25]                 0.0 2.9E-5 8.8E-5
##                 model_ids               rmsle
## 1 dl_grid_random_model_35 0.15415964403170093

best_model <- h2o.getModel(grid@model_ids[[1]]) 
best_model

## Model Details:
## ==============
## 
## H2ORegressionModel: deeplearning
## Model ID:  dl_grid_random_model_35 
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 2,201 weights/biases, 32.8 KB, 11,425 training samples, mini-batch size 1
##   layer units      type dropout       l1       l2 mean_rate rate_rms
## 1     1     8     Input  0.00 %       NA       NA        NA       NA
## 2     2    25 Rectifier  0.00 % 0.000029 0.000088  0.004732 0.011194
## 3     3    25 Rectifier  0.00 % 0.000029 0.000088  0.002700 0.002512
## 4     4    25 Rectifier  0.00 % 0.000029 0.000088  0.042374 0.198491
## 5     5    25 Rectifier  0.00 % 0.000029 0.000088  0.106371 0.299764
## 6     6     1    Linear      NA 0.000029 0.000088  0.040987 0.198701
##   momentum mean_weight weight_rms mean_bias bias_rms
## 1       NA          NA         NA        NA       NA
## 2 0.000000    0.006270   0.248785  0.493907 0.040107
## 3 0.000000    0.002600   0.192155  1.008167 0.023445
## 4 0.000000   -0.002637   0.203919  0.997518 0.016124
## 5 0.000000    0.004498   0.204662  0.998672 0.005301
## 6 0.000000    0.087561   0.297753 -0.001093 0.000000
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9986 samples **
## 
## MSE:  43213892971
## RMSE:  207879.5
## MAE:  94494.28
## RMSLE:  0.1419171
## Mean Residual Deviance :  43213892971
## 
## 
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
## 
## MSE:  134423633006
## RMSE:  366638.3
## MAE:  100310.1
## RMSLE:  0.1541596
## Mean Residual Deviance :  134423633006

Inference: From the above output, we can infer that deep learning model with hidden layer [25,25,25,25] gives the best output with the lowest rmsle value.

Result Summary & Discussion

results<-data.frame(Model_Name = c("GLM", "Random Forest", "GBM",
                                   "GBM with Parameters","Deep Learning 1", 
                                   "Deep Learning 2","Deep Learning 3" ),
   RSquare = c(0.760, 0.953, 0.980, 0.988, 0.457, 0.975, -0.013),
   RMSE= c(332260.8, 146563.2, 88505.25, 66803.47, 468974.5, 100355, 636930.5),
   RMLSE = c(0.248, 0.057, 0.073, 0.046, 0.207, 0.060, 0.489))

library(kableExtra)
kable(results) %>%
  kable_styling(bootstrap_options = "striped") %>%
  row_spec(4, bold = T, background = "#baeeb9")

Model_Name	RSquare	RMSE	RMLSE
GLM	0.760	332260.80	0.248
Random Forest	0.953	146563.20	0.057
GBM	0.980	88505.25	0.073
GBM with Parameters	0.988	66803.47	0.046
Deep Learning 1	0.457	468974.50	0.207
Deep Learning 2	0.975	100355.00	0.060
Deep Learning 3	-0.013	636930.50	0.489

Comparing results from all the above models, we find that the best prediction is done by the model: “GBM with parameters”. The summarised results show the best values of R2 as 0.998, RMSE = 21543, RMSLE = 0.03, which is better than any of the other models used for predicting the property prices.

Shutting Down H2O session

h2o.shutdown(prompt = F)

## [1] TRUE

ZillowAnalysis

Akansha Jain, Ranjani Anjur Venkatraman, Swapna Bandaru

12/6/2019

Data Summary

Data load

Data Wrangling

Exploration and Discussion

Data Exploration/Visualization

Plot1

Plot2

Plot3

Plot4

Plot5

Plot 6

ML Procedure

Check to overcome multi collinearity

Data Setup for H2O

GLM : Generalized Linear Model

Random forest - 5-Fold Cross-validation

GBM

GBM with parameters

Deep Learning

Hyper-Parameter Grid Search- Deep Learning

Result Summary & Discussion

Shutting Down H2O session