Title: Zillow Analysis
Team Members: Akansha Jain, Ranjani Anjur Venkatraman, Swapna Bandaru
Business Context: Zillow is an online American real estate database with the purpose of empowering consumers with information helping them make informed decisions for buying, selling and renting properties. Our aim is to predict the Selling Price of properties by using the available properties on Zillow for multiple neighborhoods.
Problem Description: Whenever a property holder wants to sell his property, below are some of the issues faced: may not be aware of the current price of the property may not know comparable prices of similar kind of properties in the area
URL link : https://github.com/RanjaniAnjurVenkatraman/Zillow_Data/blob/master/zillow_final.zip
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
The dataset used in this assignment has been taken from an external Data Source: To collect valid New York Addresses, we used the New York City Pluto database: The addresses are then used to fetch ZPID(Zillow Property IDs) using GetDeepSearchResults API. Next, the ZPID’s collected are then passed t0 GetDeepComps API to fetch the final dataset. The final dataset contains 24,482 observations about multiple zillow properties. There are 28 features defining these properties which includes address, zipcode, city, state, latitude, longitude, region name, region id, type, Zestimate, Zest_lastupdated, zest_monthlyChange, Zest_percentile, Zestimate_low, Zestimate_high, compsScore, bathrooms, bedrooms, finishedSqFt, lastSoldPrice, lotSizeSqFt, taxAssessment, taxAssessmentYear, totalRooms, YearBuilt. The final dataset was converted to csv and the file has been saved in out Git Repository available at https://github.com/RanjaniAnjurVenkatraman/Zillow_Data/blob/master/zillow_final.zip.
The analysis begins with loading the dataset and the required libraries in R.
library(dplyr)
library(tidyverse)
library(knitr)
library(ggthemes)
library(caret)
library(h2o)
library(ggplot2)
library(geosphere)
library(kableExtra)
We are using the data uploaded to git repository. To download the data we hit the git repository link and download the zip file. We then unzip it to get the data in the system for analysis.
temp <- tempfile()
download.file("https://raw.githubusercontent.com/RanjaniAnjurVenkatraman/Zillow_Data/master/zillow_final.zip",temp)
zdata <- read.csv(unz(temp, "zillow.csv"), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)
unlink(temp)
str(zdata)
## 'data.frame': 24482 obs. of 28 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ zpid : int 2096383529 121711481 245067526 245014557 245012575 30566078 30566079 68312584 300077093 68312616 ...
## $ address : chr NA "183 Plymouth St PENTHOUSE N" "130 Furman St APT 408" "51 Jay St APT 1M" ...
## $ zipcode : int NA 11201 11201 11201 11201 11201 11201 11201 11201 11201 ...
## $ city : chr NA "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ state : chr NA "NY" "NY" "NY" ...
## $ lat : num NA 40.7 40.7 40.7 40.7 ...
## $ long : num NA -74 -74 -74 -74 ...
## $ region_name : chr NA "DUMBO" "Brooklyn Heights" "DUMBO" ...
## $ region_id : int NA 270841 403122 270841 403122 270841 270841 270841 403122 270841 ...
## $ type : chr NA "neighborhood" "neighborhood" "neighborhood" ...
## $ zestimate : int NA 4108248 4903925 2062475 2376619 3321319 4405368 2351179 2879316 2168759 ...
## $ zest_lastupdated : chr NA "12/1/2019" "12/1/2019" "12/1/2019" ...
## $ zest_monthlychange: int NA 19592 52400 -6919 -8769 -13587 -18049 -9633 -54455 -7262 ...
## $ zest_percentile : int NA 97 97 83 88 95 97 88 0 85 ...
## $ zestimate_low : int NA 3738506 4413533 1918102 2186489 3055613 4052939 2163085 2591384 2016946 ...
## $ zestimate_high : int NA 4313660 5443357 2206848 2566749 3620238 4801851 2562785 3080868 2320572 ...
## $ compscore : int NA NA 5 11 11 4 3 10 8 11 ...
## $ bathrooms : num NA NA 4 3 1 2.5 3.5 3 NA 3 ...
## $ bedrooms : int NA NA 3 2 1 2 3 3 0 3 ...
## $ finishedSqFt : int NA 2156 4163 1451 1633 2198 2592 1700 1746 1656 ...
## $ lastSoldDate : chr NA "5/22/2014" "7/2/2019" "7/9/2019" ...
## $ lastSoldPrice : num NA 3082039 4291240 2078000 2400000 ...
## $ lotSizeSqFt : int NA NA NA NA NA NA NA NA 12876 NA ...
## $ taxAssessment : num NA 752407 1078954 248039 425741 ...
## $ taxAssessmentYear : int NA 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
## $ totalRooms : int NA NA NA 4 NA NA NA NA NA NA ...
## $ yearBuilt : int NA 1920 2015 2016 2015 1913 1916 1916 2012 1916 ...
rmarkdown::paged_table(zdata)
Checking for the datatypes of the column
sapply(zdata,class)
## X zpid address
## "integer" "integer" "character"
## zipcode city state
## "integer" "character" "character"
## lat long region_name
## "numeric" "numeric" "character"
## region_id type zestimate
## "integer" "character" "integer"
## zest_lastupdated zest_monthlychange zest_percentile
## "character" "integer" "integer"
## zestimate_low zestimate_high compscore
## "integer" "integer" "integer"
## bathrooms bedrooms finishedSqFt
## "numeric" "integer" "integer"
## lastSoldDate lastSoldPrice lotSizeSqFt
## "character" "numeric" "integer"
## taxAssessment taxAssessmentYear totalRooms
## "numeric" "integer" "integer"
## yearBuilt
## "integer"
Converting the required columns to numeric
zdata$zpid<-as.numeric(zdata$zpid)
zdata$zestimate<-as.numeric(zdata$zestimate)
zdata$zestimate_low<-as.numeric(zdata$zestimate_low)
zdata$zestimate_high<-as.numeric(zdata$zestimate_high)
zdata$lotSizeSqFt<-as.numeric(zdata$lotSizeSqFt)
summary(zdata)
## X zpid address zipcode
## Min. : 1 Min. :3.000e+07 Length:24482 Min. : 7030
## 1st Qu.: 6121 1st Qu.:6.491e+07 Class :character 1st Qu.:11201
## Median :12242 Median :8.907e+07 Mode :character Median :11201
## Mean :12242 Mean :2.918e+08 Mean :11179
## 3rd Qu.:18362 3rd Qu.:2.161e+08 3rd Qu.:11216
## Max. :24482 Max. :2.147e+09 Max. :11385
## NA's :651
## city state lat long
## Length:24482 Length:24482 Min. :40.59 Min. :-74.09
## Class :character Class :character 1st Qu.:40.68 1st Qu.:-73.99
## Mode :character Mode :character Median :40.69 Median :-73.98
## Mean :40.69 Mean :-73.98
## 3rd Qu.:40.70 3rd Qu.:-73.97
## Max. :40.80 Max. :-73.86
## NA's :651 NA's :651
## region_name region_id type zestimate
## Length:24482 Min. : 5837 Length:24482 Min. : 173321
## Class :character 1st Qu.:270816 Class :character 1st Qu.: 748146
## Mode :character Median :272902 Mode :character Median : 992678
## Mean :290966 Mean :1187194
## 3rd Qu.:273903 3rd Qu.:1411105
## Max. :403223 Max. :9217860
## NA's :651 NA's :1782
## zest_lastupdated zest_monthlychange zest_percentile zestimate_low
## Length:24482 Min. :-1249670 Min. : 0.00 Min. : 159455
## Class :character 1st Qu.: -4801 1st Qu.:10.00 1st Qu.: 695776
## Mode :character Median : -3022 Median :34.00 Median : 915005
## Mean : -3700 Mean :38.23 Mean :1088827
## 3rd Qu.: -2015 3rd Qu.:62.00 3rd Qu.:1282860
## Max. : 953637 Max. :99.00 Max. :8664788
## NA's :2116 NA's :651 NA's :1782
## zestimate_high compscore bathrooms bedrooms
## Min. : 186192 Min. : 0.000 Min. : 0.500 Min. : 0.000
## 1st Qu.: 804710 1st Qu.: 5.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median :1074137 Median : 7.000 Median : 1.000 Median : 2.000
## Mean :1294075 Mean : 7.629 Mean : 1.596 Mean : 1.867
## 3rd Qu.:1539740 3rd Qu.:10.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :9770932 Max. :18.000 Max. :12.000 Max. :18.000
## NA's :1782 NA's :1786 NA's :4311 NA's :3772
## finishedSqFt lastSoldDate lastSoldPrice lotSizeSqFt
## Min. : 1 Length:24482 Min. : 10 Min. : 1
## 1st Qu.: 750 Class :character 1st Qu.: 743000 1st Qu.: 1800
## Median : 1019 Mode :character Median : 960000 Median : 2260
## Mean : 2613 Mean : 1135915 Mean : 10488
## 3rd Qu.: 1455 3rd Qu.: 1369500 3rd Qu.: 4257
## Max. :883265 Max. :62742225 Max. :980100
## NA's :5670 NA's :1362 NA's :17439
## taxAssessment taxAssessmentYear totalRooms yearBuilt
## Min. : 1000 Min. :2007 Min. : 1.000 Min. :1833
## 1st Qu.: 150802 1st Qu.:2018 1st Qu.: 3.000 1st Qu.:1910
## Median : 223129 Median :2018 Median : 4.000 Median :1931
## Mean : 1192765 Mean :2017 Mean : 4.407 Mean :1952
## 3rd Qu.: 586000 3rd Qu.:2018 3rd Qu.: 5.000 3rd Qu.:2005
## Max. :81260000 Max. :2019 Max. :24.000 Max. :2019
## NA's :8516 NA's :5376 NA's :17244 NA's :3660
In order to come up with a final dataset, we would be performing data cleaning: 1) Select data to ensure unique observations 2) Select data where valuation and bedrooms are non-missing 3) Removing outliers: -Remove homes greater than 5 bedrooms -Remove homes greater $10 million -Remove neighborhoods with less than 10 observations
zdata<-zdata[(zdata$zpid %in% unique(zdata$zpid)),]
zclean<-zdata[!is.na(zdata$bedrooms)&!is.na(zdata$zestimate),]
zclean<-filter(zclean,bedrooms<=5)
zclean<-filter(zclean,zestimate<=10000000)
neighborhoods<-c("Canarsie","Williamsburg","Ocean Hill","Brownsville","Chelsea","Vinegar Hill","Garment District","Greenwood","Red Hook","Wingate")
zclean<-zclean[!(zclean$region_name %in% neighborhoods),]
rmarkdown::paged_table(zclean)
Checking Price variablility based on distance
Since most of our addresses are in the NY and surrounding regions, we are considering Mid-town Manhattan as the focus point of our analysis. We are calculating the distance of all other properties from Mid-town, Manhattan to compare the variability in prices as the distnace changes. We are using the latitude and longitude information in the dataset for calculating the distance using “Crow-flies” formula as captured from geosphere package in R (https://cran.r-project.org/web/packages/geosphere/geosphere.pdf). Mid-town, Manhattan co-ordinates as reference: (ref_lat=40.75047, ref_long=-73.98961) After calculation, we are converting the distance from metres to miles.
ref_lat=40.75047
ref_long=-73.98961
ref_loc<-c(ref_long,ref_lat)
zclean$ref_long<-ref_long
zclean$ref_lat<-ref_lat
zclean$distance <- distGeo(zclean[,c('long','lat')], zclean[,c('ref_long','ref_lat')],a=6378137,f=1/298.257223563)
zclean$distance<-zclean$distance*0.000621371
rmarkdown::paged_table(zclean)
We are now looking at variation in price based on the region_namees, bedroom count and bathroom count.
zclean %>% count(region_name) %>% arrange(desc(n))
## # A tibble: 51 x 2
## region_name n
## <chr> <int>
## 1 Fort Greene 3480
## 2 Brooklyn Heights 3431
## 3 Park Slope 2499
## 4 DUMBO 2058
## 5 Downtown 1961
## 6 Clinton Hill 1234
## 7 Bedford Stuyvesant 697
## 8 Prospect Heights 666
## 9 Gowanus 572
## 10 Boerum Hill 359
## # ... with 41 more rows
zclean %>% count(bedrooms)
## # A tibble: 6 x 2
## bedrooms n
## <int> <int>
## 1 0 1008
## 2 1 8398
## 3 2 5881
## 4 3 1929
## 5 4 540
## 6 5 672
zclean %>% count(bathrooms)%>% arrange(desc(n))
## # A tibble: 12 x 2
## bathrooms n
## <dbl> <int>
## 1 1 10540
## 2 2 5210
## 3 1.5 763
## 4 3 752
## 5 4 486
## 6 NA 376
## 7 2.5 154
## 8 5 123
## 9 3.5 17
## 10 7 4
## 11 6 2
## 12 0.5 1
summarize(group_by(zclean,region_name),med=median(zestimate),avg=mean(zestimate))%>% arrange(desc(med))
## # A tibble: 51 x 3
## region_name med avg
## <chr> <dbl> <dbl>
## 1 Flatlands 4569779 4569779
## 2 Midtown 2930558 2930733
## 3 Tribeca 2850148 2850148
## 4 Carroll Gardens 2647610 2450236.
## 5 Inwood 2038615 2038615
## 6 Windsor Terrace 1606190. 1540155.
## 7 Carnegie Hill 1575946 1575946
## 8 Sheepshead Bay 1483645 1322685.
## 9 Columbia Street Waterfront District 1391213 1876804.
## 10 Boerum Hill 1362111 1661109.
## # ... with 41 more rows
summarize(group_by(zclean,bedrooms),med=median(zestimate),avg=mean(zestimate))%>% arrange(desc(med))
## # A tibble: 6 x 3
## bedrooms med avg
## <int> <dbl> <dbl>
## 1 4 2038330 2338327.
## 2 5 1694277 2088816.
## 3 3 1569250 1748148.
## 4 2 1194324 1231649.
## 5 1 777033 836998.
## 6 0 579750 959679.
The plot below shows the zestimate values for the properties.
options(scipen = 999)
plot<-ggplot(data=zclean,aes(x=zestimate))
plot+geom_histogram(color = 'blue', binwidth = 100000)+ggtitle("Zillow House Prices")
Plot importance
From the above distribution, we can infer that most of the properties are priced in the range of $100,000 and $2,500,000.
The below plot shows the relationship between the square feet and zestimate value of the properties.
zclean$bedrooms<-as.factor(zclean$bedrooms)
zclean$bathrooms<-as.factor(zclean$bathrooms)
plot<-ggplot(data=zclean[zclean$lotSizeSqFt<10000&!is.na(zclean$lotSizeSqFt),],aes(x=lotSizeSqFt, y=zestimate))
plot+geom_point(aes(color=bedrooms))+ggtitle("Zestimate vs Sq.Ft")+ylab("Property Value in $")+scale_y_continuous(breaks=seq(0,10000000,2000000))
Plot importance
From the above distribution, we can infer that most of the properties are sized between 1250 to 2500sqft, the same range also contains the most varied houses with respect to the number of bedrooms.
The below plot shows the relationship between distance and zestimate value.
plot<-ggplot(data=zclean,aes(x=distance, y=zestimate))
plot+geom_point(aes(color=bedrooms))+ggtitle("Zestimate vs Distance")+ylab("Property Value in $")+xlab("Distance in miles")+scale_y_continuous(breaks=seq(0,10000000,2000000))
Plot importance
From the above distribution, we can infer that most of the properties,lie in the distance range of 3 to 6 miles from Mid-town, Manhattan and among these properties, most of them are priced below 2 million. The lower priced properties have majorly 0 to 3 bedrooms. We can also see that, as the number of bedrooms increases, the property value also increases.
In the below plot we represent property price distribution with respect to the number of bedrooms via box plots.
ggplot(zclean, aes(y=zestimate, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Property Value in $")+xlab("Bedrooms")+ggtitle("Box Plots of Property Value")
Plot importance
From the above distribution, we can infer that with an increase in the number of bedrooms, the property price increases. Considering properties with 4 or 5 bedrooms, we can see that the median price of a 4 bedroom property is higher than that of a 5 bedroom property, which is surprising!. This could be either because many 4 bedroom properties are located in more expensive locations or some other factor influencing the prices.
In the below plot we represent property price distribution with respect to property size via box plots.
ggplot(zclean[zclean$lotSizeSqFt<10000&!is.na(zclean$lotSizeSqFt),], aes(y=lotSizeSqFt, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Size in Square Feet")+xlab("Bedrooms")+ggtitle("Box Plots of Property Size")
Plot importance
From the above distribution, we can see that 75-100% of the properties with 0,3,4 and 5 bedrooms have a size less than 2500 square feet. However, more than 75% of the properties with 1 and 2 bedrooms have a property size more than 2500 square feet. Here, we can clearly see that our data is skewed and hence may affect the predicition by our models.
In the below plot we compare the prices of properties among different area of central brooklyn, by filtering with the area’s zipcodes.
zclean1 <- zclean
head(zclean1)
## X zpid address zipcode city state lat
## 1 3 245067526 130 Furman St APT 408 11201 Brooklyn NY 40.70043
## 2 4 245014557 51 Jay St APT 1M 11201 Brooklyn NY 40.70341
## 3 5 245012575 90 Furman St APT 801 11201 Brooklyn NY 40.70154
## 4 6 30566078 1 Main St APT 3A 11201 Brooklyn NY 40.70354
## 5 7 30566079 1 Main St APT 3B 11201 Brooklyn NY 40.70354
## 6 8 68312584 70 Washington St APT 8D 11201 Brooklyn NY 40.70208
## long region_name region_id type zestimate
## 1 -73.99640 Brooklyn Heights 403122 neighborhood 4903925
## 2 -73.98628 DUMBO 270841 neighborhood 2062475
## 3 -73.99587 Brooklyn Heights 403122 neighborhood 2376619
## 4 -73.99023 DUMBO 270841 neighborhood 3321319
## 5 -73.99023 DUMBO 270841 neighborhood 4405368
## 6 -73.98994 DUMBO 270841 neighborhood 2351179
## zest_lastupdated zest_monthlychange zest_percentile zestimate_low
## 1 12/1/2019 52400 97 4413533
## 2 12/1/2019 -6919 83 1918102
## 3 12/1/2019 -8769 88 2186489
## 4 12/1/2019 -13587 95 3055613
## 5 12/1/2019 -18049 97 4052939
## 6 12/1/2019 -9633 88 2163085
## zestimate_high compscore bathrooms bedrooms finishedSqFt lastSoldDate
## 1 5443357 5 4 3 4163 7/2/2019
## 2 2206848 11 3 2 1451 7/9/2019
## 3 2566749 11 1 1 1633 4/16/2019
## 4 3620238 4 2.5 2 2198 1/11/2019
## 5 4801851 3 3.5 3 2592 1/10/2019
## 6 2562785 10 3 3 1700 1/10/2019
## lastSoldPrice lotSizeSqFt taxAssessment taxAssessmentYear totalRooms
## 1 4291240 NA 1078954 2018 NA
## 2 2078000 NA 248039 2018 4
## 3 2400000 NA 425741 2018 NA
## 4 3355000 NA 302233 2018 NA
## 5 4450000 NA 466341 2018 NA
## 6 2375000 NA 372245 2018 NA
## yearBuilt ref_long ref_lat distance
## 1 2015 -73.98961 40.75047 3.470907
## 2 2016 -73.98961 40.75047 3.252032
## 3 2015 -73.98961 40.75047 3.391957
## 4 1913 -73.98961 40.75047 3.238517
## 5 1916 -73.98961 40.75047 3.238517
## 6 1916 -73.98961 40.75047 3.338864
zclean1<-filter(zclean1,zipcode %in% c("11212", "11216", "11233", "11238"))
zclean1$zipcode<-as.factor(zclean1$zipcode)
plot5<-ggplot(data = zclean1,aes(x = zipcode, y = zestimate)) +
geom_point() +
labs(x = "zip", y = "price")
plot5
Plot importance
From the above distribution, we can see that for the zip 11238, the property prices are the highest when compared to the other areas. Also, there are more number of properties in this zipcode, reason could be a preferred and safe living area for people ccausing high prices.
modeldf <- zclean
modeldf$bathrooms<-as.numeric(modeldf$bathrooms)
modeldf$bedrooms<-as.numeric(modeldf$bedrooms)
modeldf <- modeldf %>% mutate_if(., is.numeric, ~replace(., is.na(.), 0))
Splitting data into training and testing sets
set.seed(123)
training.samples <- modeldf$zestimate %>%createDataPartition(p = 0.8, list = FALSE)
train.data <- modeldf[training.samples, ]
test.data <- modeldf[-training.samples, ]
Linear Regression
lm_model1 <- lm(zestimate ~ zestimate_low+zestimate_high+zest_monthlychange+zest_percentile+compscore+bedrooms+bathrooms+
finishedSqFt+lastSoldPrice+lotSizeSqFt , data = train.data)
predictions <- lm_model1 %>% predict(test.data)
data.frame(
R2 = R2(predictions, test.data$zestimate)
)
## R2
## 1 0.9996352
As seen from the obove result, the value of R2 is approximately 1.It occurs when there are high correlations among predictor variables, leading to unreliable and unstable estimates of regression coefficients. So next we look out for multi-collinearity to remove such features from our analysis.
car::vif(lm_model1)
## zestimate_low zestimate_high zest_monthlychange
## 64.036310 60.173868 1.043842
## zest_percentile compscore bedrooms
## 1.985705 1.056540 2.645083
## bathrooms finishedSqFt lastSoldPrice
## 2.953741 1.146685 2.755964
## lotSizeSqFt
## 1.144532
Looking at the above output, we can see that the features, zestimate_low an zestimate_high show high values of VIF (Variance Inflation Factor). These have high dependence on our prediction. Hence, we will remove these 2 features and use the remaining features for modeling and prediction.
Re-running the linear regression model with updated feature list:
lm_model1 <- lm(zestimate ~ zest_monthlychange+zest_percentile+compscore+bedrooms+bathrooms+finishedSqFt+lastSoldPrice+lotSizeSqFt , data = train.data)
# Make predictions
predictions <- lm_model1 %>% predict(test.data)
# Model performance
data.frame(
R2 = R2(predictions, test.data$zestimate)
)
## R2
## 1 0.7321444
From the above result, we can see that the R2 has reduced by removing the features causing multicolinearity. We would be using the same featues for different H2O models moving on.
Zmodeldf <- modeldf[,c("zest_monthlychange","zest_percentile",
"compscore","bedrooms","bathrooms","finishedSqFt",
"lastSoldPrice","lotSizeSqFt","zestimate")]
set.seed(123)
training.samples <- Zmodeldf$zestimate %>%createDataPartition(p = 0.8, list = FALSE)
train.data <- Zmodeldf[training.samples, ]
test.data <- Zmodeldf[-training.samples, ]
Creating H2O dataframe
kd_h2o<-h2o.init(nthreads = -1,max_mem_size = "16g")
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\RANJAN~1\AppData\Local\Temp\Rtmp0ca7Ct/h2o_Ranjani_Krishna_started_from_r.out
## C:\Users\RANJAN~1\AppData\Local\Temp\Rtmp0ca7Ct/h2o_Ranjani_Krishna_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 3 seconds 376 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.26.0.2
## H2O cluster version age: 4 months and 12 days !!!
## H2O cluster name: H2O_started_from_R_Ranjani_Krishna_hzp306
## H2O cluster total nodes: 1
## H2O cluster total memory: 16.00 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.3 (2019-03-11)
install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
library(h2oEnsemble)
data_h2o <- as.h2o(
train.data,
destination_frame= "train.hex"
)
##
|
| | 0%
|
|=================================================================| 100%
new_data_h2o <- as.h2o(
test.data,
destination_frame= "test.hex"
)
##
|
| | 0%
|
|=================================================================| 100%
splits <- h2o.splitFrame(data = data_h2o,
ratios = c(0.7, 0.15),
seed = 1234)
train_h2o <- splits[[1]]
valid_h2o <- splits[[2]]
test_h2o <- splits[[3]]
y <- "zestimate"
x <- setdiff(names(train_h2o), y)
glm <- h2o.glm(family= "gaussian", x= x, y=y, training_frame=train_h2o, lambda = 0, compute_p_values = TRUE)
##
|
| | 0%
|
|=================================================================| 100%
Summary of GLM Model:
h2o.performance(glm, newdata = test_h2o)
## H2ORegressionMetrics: glm
##
## MSE: 110397214797
## RMSE: 332260.8
## MAE: 160639.1
## RMSLE: 0.2483794
## Mean Residual Deviance : 110397214797
## R^2 : 0.7604126
## Null Deviance :1000819984321746
## Null D.o.F. :2171
## Residual Deviance :239782750539539
## Residual D.o.F. :2163
## AIC :61412.07
summary(glm)
## Model Details:
## ==============
##
## H2ORegressionModel: glm
## Model Key: GLM_model_R_1575930910406_1
## GLM Model: summary
## family link regularization number_of_predictors_total
## 1 gaussian identity None 8
## number_of_active_predictors number_of_iterations training_frame
## 1 8 1 RTMP_sid_9972_3
##
## H2ORegressionMetrics: glm
## ** Reported on training data. **
##
## MSE: 76018393414
## RMSE: 275714.3
## MAE: 150023.9
## RMSLE: NaN
## Mean Residual Deviance : 76018393414
## R^2 : 0.818125
## Null Deviance :4328083861864295
## Null D.o.F. :10354
## Residual Deviance :787170463805174
## Residual D.o.F. :10346
## AIC :288842.9
##
##
##
##
##
## Scoring History:
## timestamp duration iterations negative_log_likelihood
## 1 2019-12-09 17:35:23 0.000 sec 0 4328083861864379.00000
## objective
## 1 417970435718.43402
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## variable relative_importance scaled_importance percentage
## 1 lastSoldPrice 453205.151 1.00000000 0.580899738
## 2 zest_percentile 83059.522 0.18327136 0.106462282
## 3 bathrooms 78925.904 0.17415050 0.101163980
## 4 bedrooms 55847.410 0.12322766 0.071582915
## 5 zest_monthlychange 46955.230 0.10360701 0.060185284
## 6 compscore 38903.608 0.08584105 0.049865045
## 7 lotSizeSqFt 16484.249 0.03637260 0.021128833
## 8 finishedSqFt 6796.849 0.01499729 0.008711922
rf <- h2o.randomForest(x = x,
y = y,
training_frame = train_h2o,
ntrees = 50,
nfolds = 5,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE,
seed = 1)
##
|
| | 0%
|
|======================= | 35%
|
|======================================= | 61%
|
|====================================================== | 83%
|
|========================================================== | 90%
|
|=================================================================| 100%
Summary of Random Forest Model:
h2o.performance(rf, newdata = test_h2o)
## H2ORegressionMetrics: drf
##
## MSE: 21480774857
## RMSE: 146563.2
## MAE: 20618.69
## RMSLE: 0.05751715
## Mean Residual Deviance : 21480774857
## R^2 : 0.9533818
summary(rf)
## Model Details:
## ==============
##
## H2ORegressionModel: drf
## Model Key: DRF_model_R_1575930910406_2
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 655699 20
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 20 20.00000 877 1224 1038.24000
##
## H2ORegressionMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 6321015409
## RMSE: 79504.81
## MAE: 19316.91
## RMSLE: 0.06159496
## Mean Residual Deviance : 6321015409
## R^2 : 0.9848769
##
##
##
## H2ORegressionMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 6925395930
## RMSE: 83218.96
## MAE: 19948.72
## RMSLE: 0.06362259
## Mean Residual Deviance : 6925395930
## R^2 : 0.9834309
##
##
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid
## mae 19948.72 783.6792 19904.775 21254.818
## mean_residual_deviance 6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## mse 6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## r2 0.98336756 0.0032172317 0.9874468 0.97951263
## residual_deviance 6.925396E9 1.32036314E9 5.1548047E9 8.6435461E9
## rmse 82426.24 8102.8013 71796.97 92970.67
## rmsle 0.06334415 0.0042043244 0.06040833 0.06799258
## cv_3_valid cv_4_valid cv_5_valid
## mae 20203.295 17921.762 20458.953
## mean_residual_deviance 9.21263E9 4.4630144E9 7.1529846E9
## mse 9.21263E9 4.4630144E9 7.1529846E9
## r2 0.9781843 0.9898994 0.98179466
## residual_deviance 9.21263E9 4.4630144E9 7.1529846E9
## rmse 95982.445 66805.8 84575.32
## rmsle 0.053334225 0.06505199 0.06993362
##
## Scoring History:
## timestamp duration number_of_trees training_rmse
## 1 2019-12-09 17:35:31 4.899 sec 0 NA
## 2 2019-12-09 17:35:31 4.931 sec 1 159178.42101
## 3 2019-12-09 17:35:31 4.958 sec 2 131413.82523
## 4 2019-12-09 17:35:31 4.990 sec 3 114932.15584
## 5 2019-12-09 17:35:31 5.019 sec 4 114006.43832
## training_mae training_deviance
## 1 NA NA
## 2 27712.46049 25337769715.05450
## 3 24840.05331 17269593461.08600
## 4 24849.21345 13209400445.23510
## 5 23899.69515 12997467978.91330
##
## ---
## timestamp duration number_of_trees training_rmse
## 46 2019-12-09 17:35:32 5.702 sec 45 79164.89838
## 47 2019-12-09 17:35:32 5.718 sec 46 79208.66212
## 48 2019-12-09 17:35:32 5.734 sec 47 78990.18627
## 49 2019-12-09 17:35:32 5.749 sec 48 78918.56105
## 50 2019-12-09 17:35:32 5.764 sec 49 79298.08070
## 51 2019-12-09 17:35:32 5.779 sec 50 79504.81375
## training_mae training_deviance
## 46 19215.32893 6267081135.49112
## 47 19193.52363 6274012155.13533
## 48 19162.30714 6239449527.08160
## 49 19086.95632 6228139277.74415
## 50 19374.80080 6288185603.14960
## 51 19316.91135 6321015408.70604
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 lastSoldPrice 60314599399882752.000000 1.000000 0.364607
## 2 zest_percentile 41767354362757120.000000 0.692492 0.252487
## 3 bathrooms 16725414399442944.000000 0.277303 0.101107
## 4 bedrooms 16646272110821376.000000 0.275991 0.100628
## 5 zest_monthlychange 13637810728730624.000000 0.226111 0.082442
## 6 finishedSqFt 13586902481371136.000000 0.225267 0.082134
## 7 lotSizeSqFt 1419359184486400.000000 0.023533 0.008580
## 8 compscore 1325963778457600.000000 0.021984 0.008016
gbm <- h2o.gbm(x = x, y = y, training_frame = train_h2o)
##
|
| | 0%
|
|==== | 6%
|
|=================================================================| 100%
Summary of GBM Model:
h2o.performance(gbm, newdata = valid_h2o)
## H2ORegressionMetrics: gbm
##
## MSE: 7833179156
## RMSE: 88505.25
## MAE: 30298.04
## RMSLE: 0.07396035
## Mean Residual Deviance : 7833179156
## R^2 : 0.9806906
summary(gbm)
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model Key: GBM_model_R_1575930910406_3
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 19678 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 9 32 26.62000
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 4634218128
## RMSE: 68075.09
## MAE: 27885.35
## RMSLE: 0.06843786
## Mean Residual Deviance : 4634218128
## R^2 : 0.9889126
##
##
##
##
##
## Scoring History:
## timestamp duration number_of_trees training_rmse
## 1 2019-12-09 17:35:33 0.008 sec 0 646506.33076
## 2 2019-12-09 17:35:33 0.097 sec 1 586284.50554
## 3 2019-12-09 17:35:33 0.116 sec 2 532849.96123
## 4 2019-12-09 17:35:33 0.134 sec 3 485256.66403
## 5 2019-12-09 17:35:33 0.151 sec 4 442673.67807
## training_mae training_deviance
## 1 448967.34444 417970435718.43402
## 2 406178.52896 343729521433.18701
## 3 367961.35604 283929081181.36298
## 4 333771.33120 235474029989.16101
## 5 302862.50200 195959985253.84900
##
## ---
## timestamp duration number_of_trees training_rmse
## 46 2019-12-09 17:35:34 0.767 sec 45 70579.90250
## 47 2019-12-09 17:35:34 0.773 sec 46 69816.20037
## 48 2019-12-09 17:35:34 0.778 sec 47 69520.07912
## 49 2019-12-09 17:35:34 0.783 sec 48 69035.93236
## 50 2019-12-09 17:35:34 0.788 sec 49 68696.22288
## 51 2019-12-09 17:35:34 0.793 sec 50 68075.09184
## training_mae training_deviance
## 46 29181.00324 4981522637.39805
## 47 28843.56035 4874301834.58284
## 48 28634.12138 4833041400.67046
## 49 28308.89244 4765959957.29330
## 50 28173.78544 4719171037.62527
## 51 27885.35004 4634218128.39442
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 lastSoldPrice 17210850930589696.000000 1.000000 0.764016
## 2 zest_percentile 3733240066080768.000000 0.216912 0.165724
## 3 zest_monthlychange 480851754221568.000000 0.027939 0.021346
## 4 bathrooms 421534866866176.000000 0.024492 0.018713
## 5 finishedSqFt 385460731904000.000000 0.022396 0.017111
## 6 bedrooms 141153043218432.000000 0.008201 0.006266
## 7 compscore 96304801775616.000000 0.005596 0.004275
## 8 lotSizeSqFt 57428733329408.000000 0.003337 0.002549
gbm2 <- h2o.gbm(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
ntrees = 10000,
learn_rate=0.01,
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "deviance",
sample_rate = 0.8,
col_sample_rate = 0.8,
seed = 1234,
score_tree_interval = 10
)
##
|
| | 0%
|
|= | 2%
|
|=== | 5%
|
|===== | 7%
|
|======= | 10%
|
|======== | 12%
|
|========= | 13%
|
|========== | 15%
|
|=========== | 16%
|
|=========== | 18%
|
|============ | 19%
|
|============= | 20%
|
|============== | 22%
|
|=============== | 23%
|
|================ | 25%
|
|================= | 27%
|
|================== | 28%
|
|=================== | 29%
|
|==================== | 31%
|
|===================== | 33%
|
|======================= | 35%
|
|======================== | 37%
|
|========================= | 39%
|
|========================== | 40%
|
|=========================== | 42%
|
|============================= | 44%
|
|============================== | 46%
|
|=============================== | 47%
|
|================================ | 49%
|
|================================ | 50%
|
|================================= | 51%
|
|================================== | 52%
|
|=================================== | 53%
|
|=================================== | 54%
|
|==================================== | 55%
|
|===================================== | 56%
|
|===================================== | 57%
|
|====================================== | 58%
|
|======================================= | 59%
|
|======================================= | 61%
|
|======================================== | 62%
|
|========================================= | 63%
|
|========================================= | 64%
|
|========================================== | 65%
|
|=========================================== | 66%
|
|=========================================== | 67%
|
|============================================ | 68%
|
|============================================= | 69%
|
|============================================== | 70%
|
|============================================== | 71%
|
|=============================================== | 72%
|
|================================================ | 73%
|
|================================================ | 74%
|
|================================================= | 75%
|
|================================================= | 76%
|
|================================================== | 76%
|
|================================================== | 77%
|
|=================================================== | 78%
|
|==================================================== | 80%
|
|=================================================================| 100%
Summary of GBM with Parameters Model:
h2o.performance(gbm2, newdata = valid_h2o)
## H2ORegressionMetrics: gbm
##
## MSE: 4462703024
## RMSE: 66803.47
## MAE: 13315.93
## RMSLE: 0.04603479
## Mean Residual Deviance : 4462703024
## R^2 : 0.9889991
summary(gbm2)
## Model Details:
## ==============
##
## H2ORegressionModel: gbm
## Model Key: GBM_model_R_1575930910406_4
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 7990 7990 2191678 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 6 32 17.00976
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 464115103
## RMSE: 21543.33
## MAE: 8174.98
## RMSLE: 0.03078964
## Mean Residual Deviance : 464115103
## R^2 : 0.9988896
##
##
## H2ORegressionMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 4462703024
## RMSE: 66803.47
## MAE: 13315.93
## RMSLE: 0.04603479
## Mean Residual Deviance : 4462703024
## R^2 : 0.9889991
##
##
##
##
## Scoring History:
## timestamp duration number_of_trees training_rmse
## 1 2019-12-09 17:35:35 0.005 sec 0 646506.33076
## 2 2019-12-09 17:35:35 0.093 sec 10 589353.76774
## 3 2019-12-09 17:35:35 0.176 sec 20 538040.45504
## 4 2019-12-09 17:35:35 0.262 sec 30 491783.48845
## 5 2019-12-09 17:35:36 0.340 sec 40 449778.66465
## training_mae training_deviance validation_rmse validation_mae
## 1 448967.34444 417970435718.43402 637050.85923 450053.29183
## 2 408059.45012 347337863546.77899 580371.91810 409206.47595
## 3 371341.22583 289487531257.14899 529490.63470 372597.02335
## 4 338278.54163 241850999509.10001 483573.97607 339700.18971
## 5 308463.40420 202300847174.01501 442087.84020 310076.40503
## validation_deviance
## 1 405833797241.62299
## 2 336831563324.61298
## 3 280360332230.79102
## 4 233843790330.26300
## 5 195441658448.45801
##
## ---
## timestamp duration number_of_trees training_rmse
## 795 2019-12-09 17:36:35 1 min 0.101 sec 7940 21619.63196
## 796 2019-12-09 17:36:35 1 min 0.168 sec 7950 21604.29902
## 797 2019-12-09 17:36:35 1 min 0.233 sec 7960 21589.35840
## 798 2019-12-09 17:36:35 1 min 0.296 sec 7970 21568.35333
## 799 2019-12-09 17:36:36 1 min 0.365 sec 7980 21558.19991
## 800 2019-12-09 17:36:36 1 min 0.433 sec 7990 21543.33084
## training_mae training_deviance validation_rmse validation_mae
## 795 8209.12485 467408485.95397 66801.34474 13340.36944
## 796 8200.83489 466745736.20928 66805.07833 13335.90246
## 797 8193.25902 466100396.07704 66812.79875 13329.59628
## 798 8186.15375 465193865.33732 66809.72601 13321.35402
## 799 8181.86634 464755983.38129 66799.23369 13319.46667
## 800 8174.97979 464115103.48381 66803.46565 13315.92922
## validation_deviance
## 795 4462419658.85476
## 796 4462918490.36703
## 797 4463950076.67758
## 798 4463539489.09355
## 799 4462137621.82529
## 800 4462703023.50955
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance
## 1 lastSoldPrice 132569525710225408.000000 1.000000
## 2 zest_percentile 45626773320237056.000000 0.344172
## 3 bathrooms 7427879113588736.000000 0.056030
## 4 zest_monthlychange 4609099593416704.000000 0.034767
## 5 finishedSqFt 2975650247868416.000000 0.022446
## 6 bedrooms 2120657550704640.000000 0.015997
## 7 compscore 1022267580481536.000000 0.007711
## 8 lotSizeSqFt 460563469565952.000000 0.003474
## percentage
## 1 0.673583
## 2 0.231829
## 3 0.037741
## 4 0.023419
## 5 0.015119
## 6 0.010775
## 7 0.005194
## 8 0.002340
DeepLearning: Model 1
m1 <- h2o.deeplearning(
model_id = "dl_model_first",
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
epochs = 10
)
##
|
| | 0%
|
|========================== | 40%
|
|============================================== | 70%
|
|=================================================================| 100%
Summary of Simple Deep Learning model:
h2o.performance(m1, newdata = valid_h2o)
## H2ORegressionMetrics: deeplearning
##
## MSE: 161921026841
## RMSE: 402394.1
## MAE: 122740.1
## RMSLE: 0.1789928
## Mean Residual Deviance : 161921026841
summary(m1)
## Model Details:
## ==============
##
## H2ORegressionModel: deeplearning
## Model Key: dl_model_first
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 42,201 weights/biases, 503.8 KB, 103,550 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 8 Input 0.00 % NA NA NA NA
## 2 2 200 Rectifier 0.00 % 0.000000 0.000000 0.014919 0.030916
## 3 3 200 Rectifier 0.00 % 0.000000 0.000000 0.074538 0.152987
## 4 4 1 Linear NA 0.000000 0.000000 0.005602 0.069532
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.000000 0.005605 0.105590 0.482493 0.018734
## 3 0.000000 -0.011055 0.072601 0.981213 0.012322
## 4 0.000000 0.009243 0.081288 -0.001783 0.000000
##
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9960 samples **
##
## MSE: 58737517116
## RMSE: 242358.2
## MAE: 117928.8
## RMSLE: 0.1654791
## Mean Residual Deviance : 58737517116
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
##
## MSE: 161921026841
## RMSE: 402394.1
## MAE: 122740.1
## RMSLE: 0.1789928
## Mean Residual Deviance : 161921026841
##
##
##
##
## Scoring History:
## timestamp duration training_speed epochs iterations
## 1 2019-12-09 17:36:46 0.000 sec NA 0.00000 0
## 2 2019-12-09 17:36:47 1.500 sec 9270 obs/sec 1.00000 1
## 3 2019-12-09 17:36:51 5.179 sec 22379 obs/sec 10.00000 10
## 4 2019-12-09 17:36:51 5.348 sec 22355 obs/sec 10.00000 10
## samples training_rmse training_deviance training_mae training_r2
## 1 0.000000 NA NA NA NA
## 2 10355.000000 242358.24128 58737517116.42230 117928.77556 0.86023
## 3 103550.000000 156107.63355 24369593252.66580 66020.08124 0.94201
## 4 103550.000000 242358.24128 58737517116.42230 117928.77556 0.86023
## validation_rmse validation_deviance validation_mae validation_r2
## 1 NA NA NA NA
## 2 402394.11880 161921026841.07401 122740.09528 0.60085
## 3 518178.87052 268509341855.56799 76498.22400 0.33810
## 4 402394.11880 161921026841.07401 122740.09528 0.60085
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 lastSoldPrice 1.000000 1.000000 0.138576
## 2 bedrooms 0.936130 0.936130 0.129725
## 3 zest_percentile 0.927775 0.927775 0.128568
## 4 finishedSqFt 0.908388 0.908388 0.125881
## 5 zest_monthlychange 0.871765 0.871765 0.120806
## 6 lotSizeSqFt 0.865988 0.865988 0.120005
## 7 compscore 0.860029 0.860029 0.119180
## 8 bathrooms 0.846168 0.846168 0.117259
DeepLearning: Model 2
m2 <- h2o.deeplearning(
model_id = "dl_model_faster",
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
hidden = c(32,32,32),
epochs = 1000000,
score_validation_samples = 10000,
stopping_metric = "deviance",
stopping_rounds = 2,
stopping_tolerance = 0.01
)
##
|
| | 0%
|
|=================================================================| 100%
Summary of Deep Learning model with parameters:
h2o.performance(m2, newdata = valid_h2o)
## H2ORegressionMetrics: deeplearning
##
## MSE: 5566387605
## RMSE: 74608.23
## MAE: 34382.73
## RMSLE: 0.06439333
## Mean Residual Deviance : 5566387605
summary(m2)
## Model Details:
## ==============
##
## H2ORegressionModel: deeplearning
## Model Key: dl_model_faster
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 2,433 weights/biases, 34.8 KB, 5,700,307 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 8 Input 0.00 % NA NA NA NA
## 2 2 32 Rectifier 0.00 % 0.000000 0.000000 0.014140 0.040262
## 3 3 32 Rectifier 0.00 % 0.000000 0.000000 0.008040 0.009206
## 4 4 32 Rectifier 0.00 % 0.000000 0.000000 0.011672 0.031162
## 5 5 1 Linear NA 0.000000 0.000000 0.002173 0.002203
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.000000 -0.054549 0.572098 0.429086 0.260124
## 3 0.000000 -0.028055 0.338395 1.065636 0.176645
## 4 0.000000 -0.061278 0.389261 0.979380 0.184730
## 5 0.000000 0.100799 0.508021 -0.115427 0.000000
##
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10039 samples **
##
## MSE: 4893882614
## RMSE: 69956.29
## MAE: 31373.65
## RMSLE: 0.05667772
## Mean Residual Deviance : 4893882614
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
##
## MSE: 5566387605
## RMSE: 74608.23
## MAE: 34382.73
## RMSLE: 0.06439333
## Mean Residual Deviance : 5566387605
##
##
##
##
## Scoring History:
## timestamp duration training_speed epochs iterations
## 1 2019-12-09 17:36:52 0.000 sec NA 0.00000 0
## 2 2019-12-09 17:36:53 1.186 sec 90038 obs/sec 9.65167 1
## 3 2019-12-09 17:36:58 6.362 sec 111730 obs/sec 67.57779 7
## 4 2019-12-09 17:37:04 11.688 sec 103754 obs/sec 115.86818 12
## 5 2019-12-09 17:37:09 16.740 sec 102456 obs/sec 164.15838 17
## 6 2019-12-09 17:37:14 22.623 sec 102464 obs/sec 222.12738 23
## 7 2019-12-09 17:37:20 28.207 sec 103546 obs/sec 280.08054 29
## 8 2019-12-09 17:37:26 33.722 sec 104483 obs/sec 338.01989 35
## 9 2019-12-09 17:37:31 39.298 sec 104994 obs/sec 395.96736 41
## 10 2019-12-09 17:37:37 45.247 sec 104508 obs/sec 453.90497 47
## 11 2019-12-09 17:37:43 51.131 sec 102317 obs/sec 502.22231 52
## 12 2019-12-09 17:37:49 56.834 sec 100893 obs/sec 550.48836 57
## 13 2019-12-09 17:37:49 56.873 sec 100883 obs/sec 550.48836 57
## samples training_rmse training_deviance training_mae training_r2
## 1 0.000000 NA NA NA NA
## 2 99943.000000 185013.12360 34229855902.43820 75405.92330 0.91863
## 3 699768.000000 143728.47792 20657875364.59090 52210.24068 0.95089
## 4 1199815.000000 89761.45262 8057118375.63268 36729.50837 0.98085
## 5 1699860.000000 85179.68455 7255578660.61776 34267.42185 0.98275
## 6 2300129.000000 67846.06521 4603088564.02884 33040.14791 0.98906
## 7 2900234.000000 86428.87419 7469950293.73827 36664.20307 0.98224
## 8 3500196.000000 61222.23228 3748161725.10958 26510.60201 0.99109
## 9 4100242.000000 69956.29074 4893882613.93122 31373.64533 0.98837
## 10 4700186.000000 69353.93636 4809968488.58036 26940.31179 0.98857
## 11 5200512.000000 56987.28425 3247550566.56815 24026.14405 0.99228
## 12 5700307.000000 61551.59809 3788599226.88589 23699.54617 0.99099
## 13 5700307.000000 69956.29074 4893882613.93122 31373.64533 0.98837
## validation_rmse validation_deviance validation_mae validation_r2
## 1 NA NA NA NA
## 2 511520.83126 261653560817.42599 83725.78250 0.35500
## 3 451171.70759 203555909732.73499 58419.87902 0.49822
## 4 343676.32588 118113416970.67000 45551.34609 0.70884
## 5 168364.30640 28346539670.51930 37317.98800 0.93012
## 6 222894.87990 49682127485.49180 38898.44966 0.87753
## 7 232436.16148 54026569165.57670 44597.31003 0.86682
## 8 132835.74274 17645334548.68890 31750.76144 0.95650
## 9 74608.22746 5566387604.86988 34382.72921 0.98628
## 10 129563.94402 16786815590.53130 30931.32585 0.95862
## 11 245590.68768 60314785875.98910 32127.53658 0.85132
## 12 158810.17267 25220670945.03330 29481.44544 0.93783
## 13 74608.22746 5566387604.86988 34382.72921 0.98628
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 finishedSqFt 1.000000 1.000000 0.302375
## 2 zest_monthlychange 0.487788 0.487788 0.147495
## 3 lastSoldPrice 0.382055 0.382055 0.115524
## 4 bedrooms 0.377211 0.377211 0.114059
## 5 zest_percentile 0.367212 0.367212 0.111036
## 6 bathrooms 0.332577 0.332577 0.100563
## 7 lotSizeSqFt 0.312363 0.312363 0.094451
## 8 compscore 0.047940 0.047940 0.014496
Deep learning Model 3
m3 <- h2o.deeplearning(
model_id="dl_model_tuned",
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
overwrite_with_best_model = F,
hidden = c(50,50,50),
epochs = 10,
score_validation_samples = 100,
score_duty_cycle = 0.025,
adaptive_rate = F,
rate = 0.01,
rate_annealing = 2e-6,
momentum_start = 0.2,
momentum_stable = 0.4,
momentum_ramp = 1e7,
l1 = 1e-5,
l2 = 1e-5,
max_w2 = 10
)
##
|
| | 0%
|
|=================================================================| 100%
Summary of Deep Learning model with different parameters:
h2o.performance(m3, newdata = valid_h2o)
## H2ORegressionMetrics: deeplearning
##
## MSE: 77240460743
## RMSE: 277921.7
## MAE: 120818
## RMSLE: 0.1825201
## Mean Residual Deviance : 77240460743
summary(m3)
## Model Details:
## ==============
##
## H2ORegressionModel: deeplearning
## Model Key: dl_model_tuned
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 5,601 weights/biases, 50.1 KB, 103,550 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 8 Input 0.00 % NA NA NA NA
## 2 2 50 Rectifier 0.00 % 0.000010 0.000010 0.008284 0.000000
## 3 3 50 Rectifier 0.00 % 0.000010 0.000010 0.008284 0.000000
## 4 4 50 Rectifier 0.00 % 0.000010 0.000010 0.008284 0.000000
## 5 5 1 Linear NA 0.000010 0.000010 0.008284 0.000000
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.202071 0.269970 0.870794 -9.678688 4.102140
## 3 0.202071 -0.296590 0.255310 -0.502205 1.739980
## 4 0.202071 -0.187645 0.245874 0.509115 0.996306
## 5 0.202071 -0.124947 0.413147 3.684738 0.000000
##
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 10001 samples **
##
## MSE: 76210131771
## RMSE: 276061.8
## MAE: 119784.4
## RMSLE: 0.1791665
## Mean Residual Deviance : 76210131771
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on temporary validation frame with 109 samples **
##
## MSE: 140049267268
## RMSE: 374231.6
## MAE: 170786.5
## RMSLE: 0.2173632
## Mean Residual Deviance : 140049267268
##
##
##
##
## Scoring History:
## timestamp duration training_speed epochs iterations
## 1 2019-12-09 17:37:50 0.000 sec NA 0.00000 0
## 2 2019-12-09 17:37:51 0.959 sec 12959 obs/sec 1.00000 1
## 3 2019-12-09 17:37:52 1.887 sec 62266 obs/sec 10.00000 10
## samples training_rmse training_deviance training_mae training_r2
## 1 0.000000 NA NA NA NA
## 2 10355.000000 340140.80866 115695769713.66499 221958.34388 0.72388
## 3 103550.000000 276061.82599 76210131771.44321 119784.40116 0.81812
## validation_rmse validation_deviance validation_mae validation_r2
## 1 NA NA NA NA
## 2 324890.84750 105554062792.26601 235233.00513 0.73595
## 3 374231.56904 140049267267.82001 170786.54067 0.64965
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 bedrooms 1.000000 1.000000 0.295851
## 2 bathrooms 0.889465 0.889465 0.263149
## 3 zest_percentile 0.644289 0.644289 0.190614
## 4 compscore 0.322556 0.322556 0.095429
## 5 lotSizeSqFt 0.177922 0.177922 0.052638
## 6 lastSoldPrice 0.154741 0.154741 0.045780
## 7 zest_monthlychange 0.117374 0.117374 0.034725
## 8 finishedSqFt 0.073730 0.073730 0.021813
hyper_params <- list(
hidden=list(c(20,20),c(50,50),c(30,30,30),c(25,25,25,25)),
input_dropout_ratio=c(0,0.05),
l1=seq(0,1e-4,1e-6),
l2=seq(0,1e-4,1e-6)
)
search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 360, max_models = 100, seed=1234567, stopping_rounds=5, stopping_tolerance=1e-2)
dl_random_grid <- h2o.grid(
algorithm="deeplearning",
grid_id = "dl_grid_random",
training_frame= train_h2o,
validation_frame= valid_h2o,
x=x,
y=y,
epochs=1,
stopping_metric="deviance",
stopping_tolerance=1e-2,
stopping_rounds=2,
score_validation_samples=10000,
score_duty_cycle=0.025,
max_w2=10,
hyper_params = hyper_params,
search_criteria = search_criteria
)
##
|
| | 0%
|
|== | 3%
|
|==== | 7%
|
|======== | 12%
|
|========== | 16%
|
|============ | 19%
|
|============== | 22%
|
|================= | 26%
|
|=================== | 30%
|
|====================== | 33%
|
|======================== | 36%
|
|========================= | 39%
|
|=========================== | 42%
|
|============================== | 46%
|
|================================ | 49%
|
|================================== | 53%
|
|===================================== | 58%
|
|======================================== | 62%
|
|=========================================== | 66%
|
|============================================== | 70%
|
|================================================== | 76%
|
|==================================================== | 80%
|
|======================================================= | 84%
|
|========================================================= | 88%
|
|============================================================= | 93%
|
|=============================================================== | 97%
|
|=================================================================| 100%
grid <- h2o.getGrid("dl_grid_random",sort_by="rmsle",decreasing=FALSE)
Summary of Hyper-Parameter Grid Search- Deep Learning
grid@summary_table[1,]
## Hyper-Parameter Search Summary: ordered by increasing rmsle
## hidden input_dropout_ratio l1 l2
## 1 [25, 25, 25, 25] 0.0 2.9E-5 8.8E-5
## model_ids rmsle
## 1 dl_grid_random_model_35 0.15415964403170093
best_model <- h2o.getModel(grid@model_ids[[1]])
best_model
## Model Details:
## ==============
##
## H2ORegressionModel: deeplearning
## Model ID: dl_grid_random_model_35
## Status of Neuron Layers: predicting zestimate, regression, gaussian distribution, Quadratic loss, 2,201 weights/biases, 32.8 KB, 11,425 training samples, mini-batch size 1
## layer units type dropout l1 l2 mean_rate rate_rms
## 1 1 8 Input 0.00 % NA NA NA NA
## 2 2 25 Rectifier 0.00 % 0.000029 0.000088 0.004732 0.011194
## 3 3 25 Rectifier 0.00 % 0.000029 0.000088 0.002700 0.002512
## 4 4 25 Rectifier 0.00 % 0.000029 0.000088 0.042374 0.198491
## 5 5 25 Rectifier 0.00 % 0.000029 0.000088 0.106371 0.299764
## 6 6 1 Linear NA 0.000029 0.000088 0.040987 0.198701
## momentum mean_weight weight_rms mean_bias bias_rms
## 1 NA NA NA NA NA
## 2 0.000000 0.006270 0.248785 0.493907 0.040107
## 3 0.000000 0.002600 0.192155 1.008167 0.023445
## 4 0.000000 -0.002637 0.203919 0.997518 0.016124
## 5 0.000000 0.004498 0.204662 0.998672 0.005301
## 6 0.000000 0.087561 0.297753 -0.001093 0.000000
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on training data. **
## ** Metrics reported on temporary training frame with 9986 samples **
##
## MSE: 43213892971
## RMSE: 207879.5
## MAE: 94494.28
## RMSLE: 0.1419171
## Mean Residual Deviance : 43213892971
##
##
## H2ORegressionMetrics: deeplearning
## ** Reported on validation data. **
## ** Metrics reported on full validation frame **
##
## MSE: 134423633006
## RMSE: 366638.3
## MAE: 100310.1
## RMSLE: 0.1541596
## Mean Residual Deviance : 134423633006
Inference: From the above output, we can infer that deep learning model with hidden layer [25,25,25,25] gives the best output with the lowest rmsle value.
results<-data.frame(Model_Name = c("GLM", "Random Forest", "GBM",
"GBM with Parameters","Deep Learning 1",
"Deep Learning 2","Deep Learning 3" ),
RSquare = c(0.760, 0.953, 0.980, 0.988, 0.457, 0.975, -0.013),
RMSE= c(332260.8, 146563.2, 88505.25, 66803.47, 468974.5, 100355, 636930.5),
RMLSE = c(0.248, 0.057, 0.073, 0.046, 0.207, 0.060, 0.489))
library(kableExtra)
kable(results) %>%
kable_styling(bootstrap_options = "striped") %>%
row_spec(4, bold = T, background = "#baeeb9")
| Model_Name | RSquare | RMSE | RMLSE |
|---|---|---|---|
| GLM | 0.760 | 332260.80 | 0.248 |
| Random Forest | 0.953 | 146563.20 | 0.057 |
| GBM | 0.980 | 88505.25 | 0.073 |
| GBM with Parameters | 0.988 | 66803.47 | 0.046 |
| Deep Learning 1 | 0.457 | 468974.50 | 0.207 |
| Deep Learning 2 | 0.975 | 100355.00 | 0.060 |
| Deep Learning 3 | -0.013 | 636930.50 | 0.489 |
Comparing results from all the above models, we find that the best prediction is done by the model: “GBM with parameters”. The summarised results show the best values of R2 as 0.998, RMSE = 21543, RMSLE = 0.03, which is better than any of the other models used for predicting the property prices.
h2o.shutdown(prompt = F)
## [1] TRUE