Calvin Wong
Vijaya Cherukuri
Guang Qiu
Juanelle Marks
Real estate is one of the most important sectors in the economy. It is often the largest source of wealth and savings for many families. The affordability of real estate along with changes in property prices have a direct impact on the wealth of the general population. One of the fundamental activities in the determination of housing market is performing a comparative market analysis. This is where prices of similar properties in an area undergo price examination. Since no two properties are identical, sellers make price adjustments for properties-based categorical differences to determine fair market value.
This project seeks to provide a better understanding of the relationsip between certain characteristics of housing and actual home prices. By developing our own analysis of the housing market we hope to be able to determine how cohesive our analysis is and if we can offer improvements to current Zestimate algorithms.
Our research will reference the Zillow’s ‘Zestimate’ home valuation tool which was released 12 years ago. Zestimate is a tool which estimates home values based on statistical and machine learning models that analyze hundreds of data points on each property. This tool has become one of the largest and most trusted tool for real estate pricing information in the U.S. We plan to obtain and use datasets from a Kaggle competition, in which participants were asked to improve algorithms which drive Zestimate ( datasets are available at: https://www.kaggle.com/c/zillow-prize-1/data).
#install.packages('psych')
#install.packages('leaflet')
#install.packages('ggmap')
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(dplyr)
library(shiny)
library(leaflet)
library(ggmap)
## Google Maps API Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it: see citation("ggmap") for details.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
##
## alpha, rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:ggmap':
##
## inset
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(bit64)
## Loading required package: bit
## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
##
## Attaching package: 'bit'
## The following object is masked from 'package:data.table':
##
## setattr
## The following object is masked from 'package:psych':
##
## keysort
## The following object is masked from 'package:base':
##
## xor
## Attaching package bit64
## package:bit64 (c) 2011-2012 Jens Oehlschlaegel
## creators: integer64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bin
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] format print str
## values: is.na is.nan is.finite is.infinite
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## WARNING don't use as subscripts
## WARNING semantics differ from integer
## for more help type ?bit64
##
## Attaching package: 'bit64'
## The following object is masked from 'package:bit':
##
## still.identical
## The following objects are masked from 'package:base':
##
## :, %in%, is.double, match, order, rank
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library(corrplot)
## corrplot 0.84 loaded
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
##
## day, hour, month, week, year
## The following object is masked from 'package:bit64':
##
## %in%
## The following objects are masked from 'package:data.table':
##
## hour, month, week, year
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
library(lime)
##
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
##
## explain
library(lubridate)
library(magrittr)
library(data.table)
library(bit64)
library(tidyverse)
library(lubridate)
library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(corrplot)
knitr::opts_chunk$set(error = TRUE)
#Download the dataset from Kaggle link (https://www.kaggle.com/c/zillow-prize-1/data) with the name Properties_2016.csv and download to local drive and read this file.
#Note : Due to size limitations in GitHub we are using the dataset which was downloaded to local drive
##properties <- read.csv(file="/Users/juanelle/Desktop/MSDS/Data607/final project/Project ##Proposal/FinalProject_House.csv",row.names = NULL)
properties <- read.csv(file="/Users/cwong79/Desktop/FinalProject_House_properties.csv", row.names = NULL)
download.file("https://www.dropbox.com/s/ne9l87yrzykr85v/Final_house.db?raw=1",
"FinalProject_House.sqlite" )
#load library
#install.packages('RSQLite')
library(RSQLite)
sqlite <- dbDriver("SQLite")
conn <- dbConnect(sqlite,"FinalProject_House.sqlite")
# Show all tables avaialbe in the sqllite
alltables = dbListTables(conn)
alltables
## [1] "AirConditioningTypeID" "ArchitecturalStyleTypeID"
## [3] "BuildingClassTypeID" "Column_Dictionary"
## [5] "HeatingOrSystemTypeID" "PropertyLandUseTypeID"
## [7] "StoryTypeID" "TypeConstructionTypeID"
## [9] "train_2017"
str(properties)
## 'data.frame': 2985217 obs. of 58 variables:
## $ parcelid : int 10754147 10759547 10843547 10859147 10879947 10898347 10933547 10940747 10954547 10976347 ...
## $ airconditioningtypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ architecturalstyletypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ basementsqft : int NA NA NA NA NA NA NA NA NA NA ...
## $ bathroomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bedroomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ buildingclasstypeid : int NA NA 5 3 4 4 NA NA NA 3 ...
## $ buildingqualitytypeid : int NA NA NA 6 NA 4 NA NA NA 4 ...
## $ calculatedbathnbr : num NA NA NA NA NA NA NA NA NA NA ...
## $ decktypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ finishedfloor1squarefeet : int NA NA NA NA NA NA NA NA NA NA ...
## $ calculatedfinishedsquarefeet: num NA NA 73026 5068 1776 ...
## $ finishedsquarefeet12 : int NA NA NA NA NA NA NA NA NA NA ...
## $ finishedsquarefeet13 : int NA NA NA NA NA NA NA NA NA NA ...
## $ finishedsquarefeet15 : int NA NA 73026 5068 1776 2400 NA 3611 NA 3754 ...
## $ finishedsquarefeet50 : int NA NA NA NA NA NA NA NA NA NA ...
## $ finishedsquarefeet6 : int NA NA NA NA NA NA NA NA NA NA ...
## $ fips : int 6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
## $ fireplacecnt : int NA NA NA NA NA NA NA NA NA NA ...
## $ fullbathcnt : int NA NA NA NA NA NA NA NA NA NA ...
## $ garagecarcnt : int NA NA NA NA NA NA NA NA NA NA ...
## $ garagetotalsqft : int NA NA NA NA NA NA NA NA NA NA ...
## $ hashottuborspa : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
## $ heatingorsystemtypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ latitude : int 34144442 34140430 33989359 34148863 34194168 34171873 34131929 34171345 34218210 34289776 ...
## $ longitude : int -118654084 -118625364 -118394633 -118437206 -118385816 -118380906 -118351474 -118314900 -118331311 -118432085 ...
## $ lotsizesquarefeet : num 85768 4083 63085 7521 8512 ...
## $ poolcnt : int NA NA NA NA NA NA NA NA NA NA ...
## $ poolsizesum : int NA NA NA NA NA NA NA NA NA NA ...
## $ pooltypeid10 : int NA NA NA NA NA NA NA NA NA NA ...
## $ pooltypeid2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ pooltypeid7 : int NA NA NA NA NA NA NA NA NA NA ...
## $ propertycountylandusecode : Factor w/ 235 levels "","0","010","0100",..: 14 12 150 150 162 162 22 162 14 162 ...
## $ propertylandusetypeid : int 269 261 47 47 31 31 260 31 269 31 ...
## $ propertyzoningdesc : Factor w/ 5652 levels "","**AHRP","#12",..: 1 2166 1821 1821 1843 1825 1821 574 598 4819 ...
## $ rawcensustractandblock : num 60378002 60378001 60377030 60371412 60371232 ...
## $ regionidcity : int 37688 37688 51617 12447 12447 12447 12447 396054 396054 47547 ...
## $ regionidcounty : int 3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
## $ regionidneighborhood : int NA NA NA 27080 46795 46795 274049 NA NA NA ...
## $ regionidzip : int 96337 96337 96095 96424 96450 96446 96049 96434 96436 96366 ...
## $ roomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ storytypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ threequarterbathnbr : int NA NA NA NA NA NA NA NA NA NA ...
## $ typeconstructiontypeid : int NA NA NA NA NA NA NA NA NA NA ...
## $ unitcnt : int NA NA 2 NA 1 NA NA NA NA NA ...
## $ yardbuildingsqft17 : int NA NA NA NA NA NA NA NA NA NA ...
## $ yardbuildingsqft26 : int NA NA NA NA NA NA NA NA NA NA ...
## $ yearbuilt : num NA NA 1959 1948 1947 ...
## $ numberofstories : int NA NA 1 1 1 1 NA 1 NA 1 ...
## $ fireplaceflag : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
## $ structuretaxvaluedollarcnt : num NA NA 660680 580059 196751 ...
## $ taxvaluedollarcnt : num 9 27516 1434941 1174475 440101 ...
## $ assessmentyear : int 2016 2015 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ landtaxvaluedollarcnt : num 9 27516 774261 594416 243350 ...
## $ taxamount : num NA NA 20800 14558 5725 ...
## $ taxdelinquencyflag : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ taxdelinquencyyear : int NA NA NA NA NA NA NA NA NA NA ...
## $ censustractandblock : num NA NA NA NA NA NA NA NA NA NA ...
There are two types of missing data: MCAR: missing completely at random. This is the desirable scenario in case of missing data. MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear? Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
pMiss <- function(x){sum(is.na(x))/length(x)*100}
missing.bycol <- apply(properties,2,pMiss)
missing.byrow <- apply(properties,1,pMiss)
missdata.df <- as.data.frame(missing.bycol)
setDT(missdata.df, keep.rownames = TRUE)
names(missdata.df) <- c('Col_Names', 'pct_missing')
g<-ggplot(data = missdata.df , aes(x= reorder(Col_Names, pct_missing), y=pct_missing)) + geom_bar(stat = "identity",aes(fill = pct_missing), position = position_stack(reverse= TRUE)) + coord_flip()
g
#head(missdata.df)
missdata.df20 <- missdata.df %>% filter (pct_missing>=20)
g1<-ggplot(data = missdata.df20 , aes(x= reorder(Col_Names, pct_missing), y=pct_missing)) + geom_bar(stat = "identity",aes(fill = pct_missing), position = position_stack(reverse= TRUE)) + coord_flip()
g1
Missing values that exist in more than 20% of samples may be removed from the data, which is called “80% rule”.
mis_prop <- sapply(properties, function(x) sum(is.na(x))/length(x))
var_rm <- names(mis_prop)[mis_prop > 1 - 0.8]
var_rm
## [1] "airconditioningtypeid" "architecturalstyletypeid"
## [3] "basementsqft" "buildingclasstypeid"
## [5] "buildingqualitytypeid" "decktypeid"
## [7] "finishedfloor1squarefeet" "finishedsquarefeet13"
## [9] "finishedsquarefeet15" "finishedsquarefeet50"
## [11] "finishedsquarefeet6" "fireplacecnt"
## [13] "garagecarcnt" "garagetotalsqft"
## [15] "heatingorsystemtypeid" "poolcnt"
## [17] "poolsizesum" "pooltypeid10"
## [19] "pooltypeid2" "pooltypeid7"
## [21] "regionidneighborhood" "storytypeid"
## [23] "threequarterbathnbr" "typeconstructiontypeid"
## [25] "unitcnt" "yardbuildingsqft17"
## [27] "yardbuildingsqft26" "numberofstories"
## [29] "taxdelinquencyyear"
After removing variables with more than 20% missing values, dataset had 29 variables remaining. The rest of the analysis in this project will focus on these variables.
df_rm_na <- properties[, !colnames(properties) %in% var_rm]
dim(df_rm_na)
## [1] 2985217 29
str(df_rm_na)
## 'data.frame': 2985217 obs. of 29 variables:
## $ parcelid : int 10754147 10759547 10843547 10859147 10879947 10898347 10933547 10940747 10954547 10976347 ...
## $ bathroomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ bedroomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ calculatedbathnbr : num NA NA NA NA NA NA NA NA NA NA ...
## $ calculatedfinishedsquarefeet: num NA NA 73026 5068 1776 ...
## $ finishedsquarefeet12 : int NA NA NA NA NA NA NA NA NA NA ...
## $ fips : int 6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
## $ fullbathcnt : int NA NA NA NA NA NA NA NA NA NA ...
## $ hashottuborspa : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
## $ latitude : int 34144442 34140430 33989359 34148863 34194168 34171873 34131929 34171345 34218210 34289776 ...
## $ longitude : int -118654084 -118625364 -118394633 -118437206 -118385816 -118380906 -118351474 -118314900 -118331311 -118432085 ...
## $ lotsizesquarefeet : num 85768 4083 63085 7521 8512 ...
## $ propertycountylandusecode : Factor w/ 235 levels "","0","010","0100",..: 14 12 150 150 162 162 22 162 14 162 ...
## $ propertylandusetypeid : int 269 261 47 47 31 31 260 31 269 31 ...
## $ propertyzoningdesc : Factor w/ 5652 levels "","**AHRP","#12",..: 1 2166 1821 1821 1843 1825 1821 574 598 4819 ...
## $ rawcensustractandblock : num 60378002 60378001 60377030 60371412 60371232 ...
## $ regionidcity : int 37688 37688 51617 12447 12447 12447 12447 396054 396054 47547 ...
## $ regionidcounty : int 3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
## $ regionidzip : int 96337 96337 96095 96424 96450 96446 96049 96434 96436 96366 ...
## $ roomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ yearbuilt : num NA NA 1959 1948 1947 ...
## $ fireplaceflag : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
## $ structuretaxvaluedollarcnt : num NA NA 660680 580059 196751 ...
## $ taxvaluedollarcnt : num 9 27516 1434941 1174475 440101 ...
## $ assessmentyear : int 2016 2015 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ landtaxvaluedollarcnt : num 9 27516 774261 594416 243350 ...
## $ taxamount : num NA NA 20800 14558 5725 ...
## $ taxdelinquencyflag : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ censustractandblock : num NA NA NA NA NA NA NA NA NA NA ...
#Subset assigned columns from original daqta set
col_index<-c(2:9)
juanelle<-df_rm_na[,col_index]
head(juanelle)
## bathroomcnt bedroomcnt calculatedbathnbr calculatedfinishedsquarefeet
## 1 0 0 NA NA
## 2 0 0 NA NA
## 3 0 0 NA 73026
## 4 0 0 NA 5068
## 5 0 0 NA 1776
## 6 0 0 NA 2400
## finishedsquarefeet12 fips fullbathcnt hashottuborspa
## 1 NA 6037 NA
## 2 NA 6037 NA
## 3 NA 6037 NA
## 4 NA 6037 NA
## 5 NA 6037 NA
## 6 NA 6037 NA
Since the data under each of the above variables in this subset are categorical in nature, the reseacher chose to first use the table and summary function to gain a sense of the proportion of each level under each variable.
The variable ‘bedroomcnt’ provides information on the nummber of bedrooms in a home.
table(juanelle$bedroomcnt)
##
## 0 1 2 3 4 5 6 7 8
## 118705 86941 606782 1172757 731475 182765 48915 12763 13542
## 9 10 11 12 13 14 15 16 17
## 4279 1702 425 959 86 69 24 50 11
## 18 19 20 21 23 24 25
## 9 1 8 1 1 1 1
summary(juanelle$bedroomcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 3.000 3.093 4.000 25.000 2945
The above summary shows that the homes in the data set have an average of 3.093 bedrooms.The maximum amount of bedrooms found in homes in the data set is 25 and the minimum 0. Homes with bedromm counts 19,21,23,24 and 25 are outliers in this analysis as shown in the boxplot below. It is obvious that the distribution is unimodal and skewed to the right.
boxplot(juanelle$bedroomcnt)
Description of variable according to data dictionary: Number of bathrooms in home including fractional bathroom.
bathroomcnt<-table(juanelle$bathroomcnt)
bathroomcnt
##
## 0 0.5 1 1.5 1.75 2 2.5 3 3.5
## 113470 16 499332 45735 4 1219811 208809 633089 31835
## 4 4.5 5 5.5 6 6.5 7 7.5 8
## 133922 19864 38514 6275 16416 1352 6221 385 4548
## 8.5 9 9.5 10 10.5 11 11.5 12 12.5
## 113 1341 50 496 14 200 3 269 3
## 13 14 14.5 15 16 17 18 19 19.5
## 53 39 1 21 25 8 12 3 1
## 20 31 32
## 8 1 1
summary(juanelle$bathroomcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 2.000 2.000 2.216 3.000 32.000 2957
Description of variable according to data dictionary: Number of bathrooms in home including fractional bathroom.
table(juanelle$calculatedbathnbr)
##
## 1 1.5 2 2.5 3 3.5 4 4.5 5
## 499324 45427 1219799 208578 633088 31773 133922 19811 38514
## 5.5 6 6.5 7 7.5 8 8.5 9 9.5
## 6259 16416 1340 6221 382 4548 110 1341 50
## 10 10.5 11 11.5 12 12.5 13 14 14.5
## 496 14 200 3 269 3 53 39 1
## 15 16 17 18 19 19.5 20 31 32
## 21 25 8 12 3 1 8 1 1
summary(juanelle$calculatedbathnbr)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 2.0 2.0 2.3 3.0 32.0 117156
From observation, bathroomcnt and calculatedbathnber provide the same information. However, the latter has far more missing values. Thus, bathoom count is the ideal variable to use to provide information on number of bathrooms, inclusive of fractional bathroom in homes. Shown below is a barplot visualising the distribution of bathroom count. The distribution is unimodal and skewed to the right.
barplot(bathroomcnt, main = "Bathroom Count")
Description of variable according to data dictionary: Number of full bathrooms (sink, shower + bathtub, and toilet) present in home
fullbath<-table(juanelle$fullbathcnt)
fullbath
##
## 1 2 3 4 5 6 7 8 9
## 544794 1428927 664914 153684 44721 17499 6468 4575 1347
## 10 11 12 13 14 15 16 17 18
## 495 197 268 54 39 20 25 8 12
## 19 20 31 32
## 4 8 1 1
summary(juanelle$fullbathcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.00 2.00 2.25 3.00 32.00 117156
Based on the summary, the maximum number of full bathroom’s in the data set is 32 and minimum 1. Shown below are two visualisations of this information:
barplot(fullbath, main = " Full Bath Count")
> Thus visualisation shows that the distribution is unimodal and right skewed with the center being somewhere between two and three.
boxplot(juanelle$fullbathcnt)
>The box above shows that there are a significant amount of outliers in this data set.
Description of variable according to data dictionary: This variable indicates whether or noy a home has a hot tub or spa
summary(juanelle$hashottuborspa)
## true
## 2935155 50062
Based on the summary on hottuborspa, There are 50062 homes in the data set with hottub or spa
Description of variable according to data dictionary: Finished total living room area of home
# table(juanelle$calculatedfinishedsquarefeet)
summary(juanelle$calculatedfinishedsquarefeet)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 1215 1574 1832 2140 952576 45097
Description of variable according to data dictionary:Finished living area
#table(juanelle$finishedsquarefeet12)
summary(juanelle$finishedsquarefeet12)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 1198 1542 1764 2075 427079 264431
Description of variable according to data dictionary:
table(juanelle$fips)
##
## 6037 6059 6111
## 2012741 745800 223744
summary(juanelle$fips)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6037 6037 6037 6048 6059 6111 2932
Based on regionidcity variable, the grouped longitude and latitude values were extracted and mean longitude and latitude were calculated. This geographic interactive widget displays where regionid is in relation to a regional map. We can determine the dataset contains properties within Southern California.
properties %>%
count(regionidcity)
## # A tibble: 187 x 2
## regionidcity n
## <int> <int>
## 1 3491 1076
## 2 3980 6
## 3 4406 21331
## 4 5465 13499
## 5 5534 49123
## 6 6021 14769
## 7 6285 19
## 8 6395 7772
## 9 6822 781
## 10 8384 16361
## # ... with 177 more rows
city_pal <- colorFactor("Set2", properties$regionidcity)
df_rm_na %>%
group_by(regionidcity = as.factor(regionidcity)) %>%
summarise(avg_lng = mean(longitude/1e6, na.rm = T),
avg_lat = mean(latitude/1e6, na.rm =T)) %>%
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(lng = ~avg_lng, lat = ~avg_lat, color = ~city_pal(regionidcity),
stroke = FALSE, fillOpacity = 0.8, popup = ~regionidcity)
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
To add detailed mapping to our analysis, a Google Map link was established using the Google Map API. This portion of code demonstrates the connection and map pull using the average longitude and latitude information above.
col_index<-c(10,11,16,23:29)
calvin<-df_rm_na[,col_index]
col_index<-c(1,2,5,6)
houses <- calvin[,col_index]
houses <- setDT(houses) %>% sample_frac(.01)
houses[, longitude := ifelse(longitude > 1e+10, longitude/1e+07, longitude/1e+06)]
houses[is.na(longitude), longitude := -118.255]
houses[, latitude := ifelse(is.na(latitude), 34.049, latitude/1e+06)]
houses[is.na(taxvaluedollarcnt), taxvaluedollarcnt := median(taxvaluedollarcnt, na.rm = TRUE)]
register_google(key = "AIzaSyCkdfkuRrcEbtgWtW6a2c5CNqjIjk6pfgY")
ggmap_credentials()
## Warning: 'ggmap_credentials' is deprecated.
## Use 'getOption("ggmap")' instead.
## See help("Deprecated")
## $google
## Key - AIzaSyCkdfkuRrcEbtgWtW6a2c5CNqjIjk6pfgY
## Account Type - standard
## Day Limit - Inf
## Second Limit - Inf
## Client -
## Signature -
##
## $display_api_key
## [1] FALSE
##
## attr(,"class")
## [1] "ggmap_credentials"
mapa <- get_map(location = c(lon = -118.255, lat = 34.049), zoom = 8, maptype = "hybrid")
## Source : https://maps.googleapis.com/maps/api/staticmap?center=34.049,-118.255&zoom=8&size=640x640&scale=2&maptype=hybrid&language=en-EN&key=xxx
Census tract and block ID combined - also contains blockgroup assignment by extension
Census tracts are geographic entities within counties and identified by a 4-digit basic code between 0001 and 9999, and may have a 2-digit suffix ranging from .01 to .98; for example, 6059.02. Census tracts are made up of census blocks. Even though a block code only occurs once in each tract, it may be used again in another tract. Blocks within tracts are identified by a 4-digit code.
If you paste these codes together, identifying each block requires the full 15 digits. For example, the census block containing the Office of the California Secretary of State, located at 1500 11th St, Sacramento, CA 95814, is fully identified by the code 06|067|001101|1085 06 – identifies California, 067 – identifies Sacramento County within California, 001101 – identifies Census Tract 11.01 within Sacramento County 1085 – identifies Census Block 1085 within tract 11.01.
The raw data contained within rawcensustractandblock is unmodified and better suited for our analysis. We will therefore utilize this variable for further analysis.
Census tract and block ID combined - also contains blockgroup assignment by extension
The year of the property tax assessment
Here, we were able to utilize the Google Map pull and insert assessmentyear variable unto it:
ggmap(mapa) +
geom_point(data = houses, aes(x = longitude, y = latitude, color = assessmentyear, alpha = 0.8)) +
scale_colour_gradient2(low = "darkred", mid = "white", high = "yellow", midpoint = 2015, space = "Lab", na.value = "grey50", guide = "colourbar") +
theme(legend.position = "null", axis.title.x = element_blank(), axis.title.y = element_blank())
The assessed value of the land area of the parcel
Assessors generally use the market value of land to help determine the assessed land value of a property. If private sales of similar properties cannot be located, assessors may have to use development cost to help determine assessed values. Land information can include recent sales, location, permitted land use and other market related factors. When assessors determine the value of the property, they include the land at its market value, plus the buildings and other structures at their depreciated replacement cost. Once the assessors calculate the fair value of the land, they then calculate the value of the buildings on the land.
summary(calvin$landtaxvaluedollarcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 79700 176619 268456 326100 94011079 59926
The assessed value of the built structure on the parcel
Improvements refer to buildings and permanent structures on the land. The most common way to determine these values is to consider how much money it would take at the current material and labor costs to replace your building with one like it. If the building is not new, consideration must also be given to how much it has depreciated. To be able to determine the value of any piece of property, it is necessary for the assessor to know what it might cost today to replace it, the selling price of similar properties and many other facts. Building information can include building size, features, type of construction, materials used, age, condition and quality of construction. Also considered are factors such as flooring, plumbing fixtures, furnaces, and finished basement rooms.
summary(calvin$structuretaxvaluedollarcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 77666 127066 178143 204000 255321161 46464
The total tax assessed value of the parcel
Property taxes are calculated on the assessed value of property. Property assessment provides the basis for determining each person’s fair share of taxes in proportion to the amount his or her individual property is worth. The final assessed value is used by government (territorial, city, town, village, etc….) to determine the amount of taxes payable by the property owner.
Here, we were able to utilize the Google Map pull and insert taxvaluedollarcnt variable unto it.
summary(calvin$taxvaluedollarcnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 188220 321161 443528 514072 319622473 34266
ggmap(mapa) +
geom_point(data = houses, aes(x = longitude, y = latitude, color = taxvaluedollarcnt, alpha = 1/1000)) +
scale_colour_gradient2(low = "darkred", mid = "white", high = "yellow", midpoint = 250000, space = "Lab", na.value = "grey50", guide = "colourbar") +
theme(legend.position = "null", axis.title.x = element_blank(), axis.title.y = element_blank())
The total property tax assessed for that assessment year
Assessors do not collect taxes or determine the amount of taxes to be collected. This is left up to the Municipality in which in which the property recides. Assessors do determine the fair value of a property so that taxes can be calculated.
plt <- qplot(calvin$taxamount, geom="histogram", binwidth = 1000, main = "Distribution of Tax Amount", ylab = "Frequency", xlab = "Tax Amount", fill=I("blue"), col=I("red"), alpha=I(.2), xlim=c(5000,30000)) +
theme(axis.text.y = element_text(angle=45))
suppressWarnings(print(plt))
landtaxvaluedollarcnt + structuretaxvaluedollarcnt = taxvaluedollarcnt
taxamount = taxable % (determined by municipality) * taxvaluedollarcnt
Property taxes for this parcel are past due as of 2015
Tax payments must be postmarked by the due date in order to be processed as a timely payment. This variable shows properties with a past due property tax as of 2015. This could mean that the property has a property-tax lien on it. A property-tax lien is a legal claim against a property for unpaid property taxes. A tax lien prohibits a property from being sold or refinanced until the taxes are paid and the lien is removed.
#str(df_rm_na)
#what are property types like-count
PropertyLandUsedDesc <- function(propertylandusetypeid)
{ propertylandusetypeid<- as.character(propertylandusetypeid)
if (propertylandusetypeid== 31)
{
return ('Commercial/Office/Residential Mixed Used')
}
else if (propertylandusetypeid== 46)
{
return ('Multi-Story Store')
}
else if (propertylandusetypeid== 47)
{
return ('Store/Office (Mixed Use)')
}
else if (propertylandusetypeid== 246)
{
return ('Duplex (2 Units, Any Combination)')
}
else if (propertylandusetypeid== 247)
{
return ('Triplex (3 Units, Any Combination)')
}
else if (propertylandusetypeid== 248)
{
return ('Quadruplex (4 Units, Any Combination)')
}
else if (propertylandusetypeid== 260)
{
return ('Residential General')
}
else if (propertylandusetypeid== 261)
{
return ('Single Family Residential')
}
else if (propertylandusetypeid== 262)
{
return ('Rural Residence')
}
else if (propertylandusetypeid== 263)
{
return ('Mobile Home')
}
else if (propertylandusetypeid== 264)
{
return ('Townhouse')
}
else if (propertylandusetypeid== 265)
{
return ('Cluster Home')
}
else if (propertylandusetypeid== 266)
{
return ('Condominium')
}
else if (propertylandusetypeid== 267)
{
return ('Cooperative')
}
else if (propertylandusetypeid== 268)
{
return ('Row House')
}
else if (propertylandusetypeid== 269)
{
return ('Planned Unit Development')
}
else if (propertylandusetypeid== 270)
{
return ('Residential Common Area')
}
else if (propertylandusetypeid== 271)
{
return ('Timeshare')
}
else if (propertylandusetypeid== 273)
{
return ('Bungalow')
}
else if (propertylandusetypeid== 274)
{
return ('Zero Lot Line')
}
else if (propertylandusetypeid== 275)
{
return ('Manufactured, Modular, Prefabricated Homes')
}
else if (propertylandusetypeid== 276)
{
return ('Patio Home')
}
else if (propertylandusetypeid== 279)
{
return ('Inferred Single Family Residential')
}
else if (propertylandusetypeid== 290)
{
return ('Vacant Land - General')
}
else if (propertylandusetypeid==291)
{
return ('Residential Vacant Land')
}
else
{
return ('Not Determined')
}
}
#visualisation of property types count
### load dataset with logerror
traindf <- dbGetQuery(conn, "select * from train_2017")
### mearge datasets from properities and dataset with logerror
target_df <- merge.data.frame(traindf, df_rm_na,by="parcelid")
# Organize numeric and and categorical data
#Convert some coulumns such as propertylandusetypeid , fips to category data
target_df$propertylandusetypeid<-factor(target_df$propertylandusetypeid)
target_df$fips <- factor (target_df$fips)
### build function to fill in NA value and clean data in the fireplaceflag column
flagColClean <- function(emp)
{ emp<- as.character(emp)
if (emp=="NA" | emp =="")
{
return (-1)
}
else
{
return (1)
}
}
target_df$fireplaceflag<- sapply( target_df$fireplaceflag, flagColClean)
target_df$fireplaceflag<- factor( target_df$fireplaceflag )
### build function to fill in NA value and clean data in the ftaxdelinquencyflag column
flagcol <- target_df %>% select (contains('flag'))
flagColClean <- function(emp)
{ emp<- as.character(emp)
if (emp=="NA" | emp =="")
{
return (-1)
}
else
{
return (1)
}
}
target_df$taxdelinquencyflag<- sapply( target_df$taxdelinquencyflag, flagColClean)
target_df$taxdelinquencyflag<- factor( target_df$taxdelinquencyflag )
### Extract Month from transactionDate variable
target_df$transactionMonth <- as.POSIXlt(target_df$transactiondate, format="%Y-%m-%d") $mon
### Make sure there is no duplicated rows per parcelid (Remove duplicate records based on the parcelid)
target_df <- target_df %>% distinct(parcelid, .keep_all = TRUE)
rownames(target_df) <- target_df$parcelid
#remove all rows with NA value
target_df2 <- target_df[complete.cases(target_df), ]
The purpose of this section is to reorganize some columns in the dataset and to create new variables.
# Create yearbuilt col to caluclate the life of property
target_df$yearbuilt[is.na(target_df$yearbuilt)]<-median(target_df$yearbuilt,na.rm=T)
target_df2$N.HousYear = 2018 -target_df2$yearbuilt
# create .LivingAreaProp col to find the living area per total lot
target_df2$N.LivingAreaProp =target_df2$calculatedfinishedsquarefeet /target_df2$lotsizesquarefeet
# Create N.ExtraSpace to indicate the size of area besides of the living area
target_df2$N.ExtraSpace =target_df2$lotsizesquarefeet -target_df2$calculatedfinishedsquarefeet
# Total number of bed rooms
target_df2$N.TotalRooms =target_df2$bathroomcnt +target_df2$bedroomcnt
# create var to indicate the average room size
target_df2$N.AvRoomSize =target_df2$calculatedfinishedsquarefeet/target_df2$N.TotalRooms
# create var to indicate how many extra rooms avaiable besides of bed romm
target_df2$N.ExtraRooms =target_df2$roomcnt -target_df2$N.TotalRooms
# Ratio of the built structure value to land area
target_df2$N.ValueProp =target_df2$structuretaxvaluedollarcnt /target_df2$landtaxvaluedollarcnt
### Deal with latitude and longitude data
target_df2$latitude = as.numeric(target_df2$latitude)
target_df2$longitude= as.numeric(target_df2$longitude)
target_df2$N.location = target_df2$latitude +target_df2$longitude
target_df2$N.location2 = as.numeric(target_df2$latitude/target_df2$longitude )
# Count the number of properties per region
city_count <- count(target_df2, regionidcity)
names(city_count) <- c('regionidcity','City_Count')
target_df2<-merge(x=target_df2,y=city_count,by="regionidcity" ,all=TRUE)
# Count the number of properties per County
county_count <- count(target_df2, regionidcounty)
names(county_count) <- c('regionidcounty','County_Count')
target_df2<-merge(x=target_df2,y=county_count,by="regionidcounty" ,all=TRUE)
# Count the number of properties per area (located in the same zip area)
zip_count <- count(target_df2, regionidzip)
names(zip_count) <- c('regionidzip','Zip_Count')
target_df2<-merge(x=target_df2,y= zip_count ,by="regionidzip" ,all=TRUE)
# Create variable to get the amount of tax per
target_df2$N.ValueRatio = target_df2$taxamount /target_df2$taxvaluedollarcnt
# Create variable to get the total amount of tax
target_df2$N.TaxScore =target_df2$taxvaluedollarcnt*target_df2$taxamount
## remove columns with free text
target_df2 <- target_df2 %>% select (-contains('propertyzoningdesc'), -contains('itude'), -starts_with('year'),-contains('itude'), -contains('rowcensus'),-contains('transactiondate'), -contains('regionid'), -contains('assessmentyear'))
## list all datatype of datafame
split(names(target_df2),sapply(target_df2, function(x) paste(class(x), collapse=" ")))
## $factor
## [1] "fips" "hashottuborspa"
## [3] "propertycountylandusecode" "propertylandusetypeid"
## [5] "fireplaceflag" "taxdelinquencyflag"
##
## $integer
## [1] "parcelid" "finishedsquarefeet12" "fullbathcnt"
## [4] "transactionMonth" "City_Count" "County_Count"
## [7] "Zip_Count"
##
## $numeric
## [1] "logerror" "bathroomcnt"
## [3] "bedroomcnt" "calculatedbathnbr"
## [5] "calculatedfinishedsquarefeet" "lotsizesquarefeet"
## [7] "rawcensustractandblock" "roomcnt"
## [9] "structuretaxvaluedollarcnt" "taxvaluedollarcnt"
## [11] "landtaxvaluedollarcnt" "taxamount"
## [13] "censustractandblock" "N.HousYear"
## [15] "N.LivingAreaProp" "N.ExtraSpace"
## [17] "N.TotalRooms" "N.AvRoomSize"
## [19] "N.ExtraRooms" "N.ValueProp"
## [21] "N.location" "N.location2"
## [23] "N.ValueRatio" "N.TaxScore"
In this section we will normalize the numeric data so as to change the shape of the distribution.
### Find all numerical and categorical columns
numcol <- sapply(target_df2,is.numeric)
### Manuaully change logerror and awcensustractandblock into categorical data because we do not want to normalize those two columns.
numcol['logerror'] <- FALSE
numcol ['rawcensustractandblock'] <- FALSE
numcol_df<- as.data.frame(numcol)
numcol_df <- setDT(numcol_df, keep.rownames = TRUE)[]
names(numcol_df) <- c ("ColName","IsNumeric")
Numeric_col <- numcol_df %>% filter (IsNumeric==TRUE)
Numeric_col <- Numeric_col$ColName
### build scaling function
normfun <- function(x) (x - min(x))/(max(x)-min(x))
### apply the function on all the numeric data
for (i in Numeric_col)
{
{target_df2[i] <- lapply(target_df2[i], normfun)}
}
There are some categorical variables with text data or multiple levels. These variables needed to be further cleaned up.
### change
target_df2$fips = as.numeric(factor(target_df2$fips,
levels = c("6037", "6059" ,"6111"),
labels = c(1, 2, 3)))
target_df2$hashottuborspa = as.numeric(factor(target_df2$hashottuborspa,
levels = c("" , "true"),
labels = c(-1, 1)))
landCodeClean <- function(emp)
{ emp<- as.character(emp)
if (emp=="0100" )
{
return (1)
}
else if (emp=="22")
{
return (2)
}
else if (emp=="010C")
{
return (3)
}
else
{
return (4)
}
}
target_df2$propertycountylandusecode<- sapply( target_df2$propertycountylandusecode, landCodeClean)
target_df2$propertycountylandusecode<- factor(target_df2$propertycountylandusecode )
propertypeClean <- function(emp)
{ emp<- as.character(emp)
if (emp=="261" )
{
return (1)
}
else if (emp=="266")
{
return (2)
}
else if (emp=="269")
{
return (3)
}
else
{
return (4)
}
}
target_df2$propertylandusetypeid<- sapply( target_df2$propertylandusetypeid, propertypeClean)
target_df2$propertylandusetypeid<- factor( target_df2$propertylandusetypeid )
dataset <- target_df2
str(dataset)
## 'data.frame': 64065 obs. of 37 variables:
## $ parcelid : num 0.00976 0.00975 0.0097 0.00979 0.00976 ...
## $ logerror : num 0.15904 0.62487 0.22207 -0.05362 -0.00454 ...
## $ bathroomcnt : num 0 0 0.167 0 0 ...
## $ bedroomcnt : num 0.273 0.182 0.545 0.182 0.182 ...
## $ calculatedbathnbr : num 0 0 0.167 0 0 ...
## $ calculatedfinishedsquarefeet: num 0.0489 0.0408 0.1266 0.0253 0.0287 ...
## $ finishedsquarefeet12 : num 0.0489 0.0408 0.1266 0.0253 0.0287 ...
## $ fips : num 1 1 1 1 1 1 1 1 1 1 ...
## $ fullbathcnt : num 0 0 0.167 0 0 ...
## $ hashottuborspa : num 1 1 1 1 1 1 1 1 1 1 ...
## $ lotsizesquarefeet : num 0.000712 0.000769 0.001014 0.000698 0.000658 ...
## $ propertycountylandusecode : Factor w/ 3 levels "1","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ propertylandusetypeid : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ rawcensustractandblock : num 60372395 60375350 60375330 60372398 60375350 ...
## $ roomcnt : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fireplaceflag : Factor w/ 2 levels "-1","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ structuretaxvaluedollarcnt : num 0.00664 0.00141 0.02021 0.00102 0.00107 ...
## $ taxvaluedollarcnt : num 0.011654 0.00089 0.012117 0.000816 0.000726 ...
## $ landtaxvaluedollarcnt : num 0.010649 0.000566 0.005607 0.000641 0.00052 ...
## $ taxamount : num 0.01275 0.00255 0.01456 0.00112 0.00286 ...
## $ taxdelinquencyflag : Factor w/ 2 levels "-1","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ censustractandblock : num 3.27e-06 1.03e-05 1.02e-05 3.28e-06 1.03e-05 ...
## $ transactionMonth : num 1 0.625 0 0.375 0.75 0.125 0.375 0.5 0.75 0.375 ...
## $ N.HousYear : num 0.645 0.674 0.478 0.652 0.652 ...
## $ N.LivingAreaProp : num 0.0224 0.0178 0.0381 0.0132 0.0154 ...
## $ N.ExtraSpace : num 0.000932 0.001014 0.000991 0.000991 0.00094 ...
## $ N.TotalRooms : num 0.15 0.1 0.4 0.1 0.1 0.2 0.1 0.2 0.1 0.05 ...
## $ N.AvRoomSize : num 0.0948 0.1091 0.1012 0.0713 0.0798 ...
## $ N.ExtraRooms : num 0.607 0.643 0.429 0.643 0.643 ...
## $ N.ValueProp : num 0.000418 0.001659 0.002414 0.001072 0.001374 ...
## $ N.location : num 0.402 0.411 0.417 0.4 0.41 ...
## $ N.location2 : num 0.581 0.577 0.57 0.585 0.576 ...
## $ City_Count : num 1 0.00568 0.00568 1 0.00568 ...
## $ County_Count : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Zip_Count : num 0.0857 0.0857 0.0857 0.0857 0.0857 ...
## $ N.ValueRatio : num 0.0251 0.0597 0.0276 0.0291 0.0793 ...
## $ N.TaxScore : num 1.52e-04 2.56e-06 1.81e-04 1.00e-06 2.39e-06 ...
dataset <- dataset %>% select(-contains("parcelid"))
H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.
#install.packages('h2o')
######################################
# Setup h2o
######################################
h2o.init(nthreads = -1, max_mem_size = '8g', ip = "127.0.0.1")
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 37 minutes 1 seconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.20.0.8
## H2O cluster version age: 2 months and 18 days
## H2O cluster name: H2O_started_from_R_cwong79_ndv646
## H2O cluster total nodes: 1
## H2O cluster total memory: 7.89 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: 127.0.0.1
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.5.1 (2018-07-02)
### split dataset Best approach by using the Gridsearch (Skipped in the project )
datah2o <- as.h2o(dataset)
##
|
| | 0%
|
|=================================================================| 100%
splits <- h2o.splitFrame(data = datah2o,
ratios = c(0.7, 0.15), #
seed = 1) #setting a seed will guarantee reproducibility
### split dataset into train , valid and test dataset
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
# Identify predictors and response
### Y is our target variable to predict
y <- "logerror"
### chose all variables except our taraget variable
x <- names(train)[which(names(train)!="logerror")]
### list all features /varaiables
print(x)
## [1] "bathroomcnt" "bedroomcnt"
## [3] "calculatedbathnbr" "calculatedfinishedsquarefeet"
## [5] "finishedsquarefeet12" "fips"
## [7] "fullbathcnt" "hashottuborspa"
## [9] "lotsizesquarefeet" "propertycountylandusecode"
## [11] "propertylandusetypeid" "rawcensustractandblock"
## [13] "roomcnt" "fireplaceflag"
## [15] "structuretaxvaluedollarcnt" "taxvaluedollarcnt"
## [17] "landtaxvaluedollarcnt" "taxamount"
## [19] "taxdelinquencyflag" "censustractandblock"
## [21] "transactionMonth" "N.HousYear"
## [23] "N.LivingAreaProp" "N.ExtraSpace"
## [25] "N.TotalRooms" "N.AvRoomSize"
## [27] "N.ExtraRooms" "N.ValueProp"
## [29] "N.location" "N.location2"
## [31] "City_Count" "County_Count"
## [33] "Zip_Count" "N.ValueRatio"
## [35] "N.TaxScore"
## Show number of records in the each dataset
nrow(train) # 114908
## [1] 44851
nrow(valid) # 24498
## [1] 9557
nrow(test) # 24581
## [1] 9657
We chose Boosting algorithms as our ML method .We chose boosting algorithms because they are extremely popular machine learning algorithms that have proven successful across many domains and is one of the leading methods for winning Kaggle competitions. Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.
Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
# library(h2o)
# h2o.init(nthreads = -1, max_mem_size = "8g")
#
#
# ### split dataset Best approach by using the Gridsearch (Skipped in the project )
#
# datah2o <- as.h2o(dataset)
# splits <- h2o.splitFrame(data = datah2o,
# ratios = c(0.7, 0.15), #
# seed = 1) #setting a seed will guarantee reproducibility
# train <- splits[[1]]
# valid <- splits[[2]]
# test <- splits[[3]]
#
#
# # Identify predictors and response
# x <- names(train)[which(names(train)!="logerror")]
# y <- "logerror"
# print(x)
#gbm.fit <- h2o.gbm(
# x = x,
# y = y,
# training_frame = train,
# nfolds = 5,
# ntrees = 5000,
# stopping_rounds = 10,
# stopping_tolerance = 0,
# seed = 123
#)
#h2o.varimp_plot(gbm.fit , num_of_features = 15)
#h2o.mae(gbm.fit)
#h2o.shutdown()
We changed the following parameters of the model:
Tree complexity:ntrees: number of trees to train max_depth: depth of each tree min_rows: Fewest observations allowed in a terminal node Learning rate: learn_rate: rate to descend the loss function gradient learn_rate_annealing: allows you to have a high initial learn_rate, then gradually reduce as trees are added (speeds up training). Adding stochastic nature: sample_rate: row sample rate per tree col_sample_rate: column sample rate per tree (synonymous with xgboost’s colsample_bytree)
# create hyperparameter grid
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
search_criteria <- list(
strategy = "RandomDiscrete",
stopping_metric = "mse",
stopping_tolerance = 0.005,
stopping_rounds = 10,
max_runtime_secs = 60*60
)
#
# perform grid search
grid <- h2o.grid(
algorithm = "gbm",
grid_id = "gbm_grid2",
x = x,
y = y,
training_frame = train,
validation_frame = valid,
#hyper_params = hyper_grid,
search_criteria = search_criteria, # add search criteria
ntrees = 5000,
stopping_rounds = 10,
stopping_tolerance = 0,
seed = 123
)
##
## ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://127.0.0.1:54321/99/Grid/gbm)
##
## water.exceptions.H2OIllegalArgumentException
## [1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input"
## [2] " hex.grid.GridSearch.start(GridSearch.java:103)"
## [3] " hex.grid.GridSearch.startGridSearch(GridSearch.java:449)"
## [4] " hex.grid.GridSearch.startGridSearch(GridSearch.java:397)"
## [5] " water.api.GridSearchHandler.handle(GridSearchHandler.java:103)"
## [6] " water.api.GridSearchHandler.handle(GridSearchHandler.java:36)"
## [7] " water.api.RequestServer.serve(RequestServer.java:451)"
## [8] " water.api.RequestServer.doGeneric(RequestServer.java:296)"
## [9] " water.api.RequestServer.doPost(RequestServer.java:222)"
## [10] " javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"
## [11] " javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"
## [12] " org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"
## [13] " org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"
## [14] " org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"
## [15] " org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"
## [16] " org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"
## [17] " org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"
## [18] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
## [19] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
## [20] " water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197)"
## [21] " org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"
## [22] " org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"
## [23] " org.eclipse.jetty.server.Server.handle(Server.java:370)"
## [24] " org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"
## [25] " org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"
## [26] " org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"
## [27] " org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"
## [28] " org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"
## [29] " org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"
## [30] " org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"
## [31] " org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"
## [32] " org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"
## [33] " org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"
## [34] " java.base/java.lang.Thread.run(Thread.java:834)"
## Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, :
##
## ERROR MESSAGE:
##
## Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
grid_id = "gbm_grid2",
sort_by = "mse",
decreasing = FALSE
)
grid_perf
## H2O Grid Details
## ================
##
## Grid ID: gbm_grid2
## Used hyper parameters:
## Number of models: 2
## Number of failed models: 0
##
## Hyper-Parameter Search Summary: ordered by increasing mse
## model_ids mse
## 1 gbm_grid2_model_1 0.02194353160742622
## 2 gbm_grid2_model_0 0.022881018848163435
#h2o.shutdown()
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated.
We identified that the top model is through the lowest MSE score.
best_model_id <- grid_perf@model_ids[[1]]
best_model <- h2o.getModel(best_model_id)
h2o.performance(model = best_model, valid = TRUE)
## H2ORegressionMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.02194353
## RMSE: 0.1481335
## MAE: 0.06644435
## RMSLE: NaN
## Mean Residual Deviance : 0.02194353
# Using the best parameters to run the ML model
h2o.final <- h2o.gbm(
x = x,
y = y,
training_frame = train,
nfolds = 5,
ntrees = 10000,
learn_rate = 0.05,
learn_rate_annealing = 0.99,
max_depth = 5,
min_rows = 10,
sample_rate = 0.75,
col_sample_rate = 1,
stopping_rounds = 10,
stopping_tolerance = 0,
seed = 123
)
##
|
| | 0%
|
|====================== | 33%
|
|=========================================== | 67%
|
|====================================================== | 83%
|
|=================================================================| 100%
We found that the MSE score is 0.026 with the current model.
h2o.performance(model =best_model , newdata = as.h2o(test))
## H2ORegressionMetrics: gbm
##
## MSE: 0.0229929
## RMSE: 0.1516341
## MAE: 0.06721181
## RMSLE: NaN
## Mean Residual Deviance : 0.0229929
After re-running our final model we likely want to understand the variables that have the largest influence on house sale price. We found that variables such as house build, year and tax amount are strongly associated with the house price.
h2o.varimp_plot(h2o.final, num_of_features = 15)
We can now apply the model to our two observations. The results show the predicted value, local model fit (both are relatively poor), and the most influential variables driving the predicted value for each observation.
We used LIME function (LIME is a newer procedure for understanding why a prediction resulted in a given value for a single observation.)
We can now apply to our two observations. The results show the predicted value (Case 1: 0.0141, Case 2: 0.00997).
#install.packages('lime')
local_obs <- as.data.frame(test[1:2, ])
local_obs$logerror
## [1] -0.448155306 -0.001193907
explainer <- lime(as.data.frame(train), h2o.final)
explanation <- explain(local_obs, explainer, n_features = 5)
##
|
| | 0%
|
|=================================================================| 100%
##
|
| | 0%
|
|=================================================================| 100%
plot_features(explanation)
## Warning: Unknown or uninitialised column: 'label'.
#h2o.shutdown()
h2o.performance(model = h2o.final, newdata = as.h2o(test))
## H2ORegressionMetrics: gbm
##
## MSE: 0.02295649
## RMSE: 0.151514
## MAE: 0.06699506
## RMSLE: NaN
## Mean Residual Deviance : 0.02295649
The MAE score provided by the current model is close to other models published on the Kaggle website. This result can be further improved with grid-search and ensemble methods.