Project Title - Analysis of Real Estate Data

Group Members:

  1. Calvin Wong

  2. Vijaya Cherukuri

  3. Guang Qiu

  4. Juanelle Marks

Introduction

Real estate is one of the most important sectors in the economy. It is often the largest source of wealth and savings for many families. The affordability of real estate along with changes in property prices have a direct impact on the wealth of the general population. One of the fundamental activities in the determination of housing market is performing a comparative market analysis. This is where prices of similar properties in an area undergo price examination. Since no two properties are identical, sellers make price adjustments for properties-based categorical differences to determine fair market value.

Over-arching Project Goals

This project seeks to provide a better understanding of the relationsip between certain characteristics of housing and actual home prices. By developing our own analysis of the housing market we hope to be able to determine how cohesive our analysis is and if we can offer improvements to current Zestimate algorithms.

Data Source

Our research will reference the Zillow’s ‘Zestimate’ home valuation tool which was released 12 years ago. Zestimate is a tool which estimates home values based on statistical and machine learning models that analyze hundreds of data points on each property. This tool has become one of the largest and most trusted tool for real estate pricing information in the U.S. We plan to obtain and use datasets from a Kaggle competition, in which participants were asked to improve algorithms which drive Zestimate ( datasets are available at: https://www.kaggle.com/c/zillow-prize-1/data).

Load Libraries

#install.packages('psych')
#install.packages('leaflet')
#install.packages('ggmap')
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(dplyr)
library(shiny)
library(leaflet)
library(ggmap)
## Google Maps API Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it: see citation("ggmap") for details.
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following objects are masked from 'package:psych':
## 
##     alpha, rescale
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:ggmap':
## 
##     inset
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
library(bit64)
## Loading required package: bit
## Attaching package bit
## package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
## creators: bit bitwhich
## coercion: as.logical as.integer as.bit as.bitwhich which
## operator: ! & | xor != ==
## querying: print length any all min max range sum summary
## bit access: length<- [ [<- [[ [[<-
## for more help type ?bit
## 
## Attaching package: 'bit'
## The following object is masked from 'package:data.table':
## 
##     setattr
## The following object is masked from 'package:psych':
## 
##     keysort
## The following object is masked from 'package:base':
## 
##     xor
## Attaching package bit64
## package:bit64 (c) 2011-2012 Jens Oehlschlaegel
## creators: integer64 seq :
## coercion: as.integer64 as.vector as.logical as.integer as.double as.character as.bin
## logical operator: ! & | xor != == < <= >= >
## arithmetic operator: + - * / %/% %% ^
## math: sign abs sqrt log log2 log10
## math: floor ceiling trunc round
## querying: is.integer64 is.vector [is.atomic} [length] format print str
## values: is.na is.nan is.finite is.infinite
## aggregation: any all min max range sum prod
## cumulation: diff cummin cummax cumsum cumprod
## access: length<- [ [<- [[ [[<-
## combine: c rep cbind rbind as.data.frame
## WARNING don't use as subscripts
## WARNING semantics differ from integer
## for more help type ?bit64
## 
## Attaching package: 'bit64'
## The following object is masked from 'package:bit':
## 
##     still.identical
## The following objects are masked from 'package:base':
## 
##     :, %in%, is.double, match, order, rank
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
library(corrplot)
## corrplot 0.84 loaded
library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit http://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:lubridate':
## 
##     day, hour, month, week, year
## The following object is masked from 'package:bit64':
## 
##     %in%
## The following objects are masked from 'package:data.table':
## 
##     hour, month, week, year
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
library(lime)
## 
## Attaching package: 'lime'
## The following object is masked from 'package:dplyr':
## 
##     explain
library(lubridate)
library(magrittr)
library(data.table)
library(bit64)
library(tidyverse)
library(lubridate)
library(mice)
## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
## 
##     complete
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(corrplot)
knitr::opts_chunk$set(error = TRUE)

Read csv file into r

#Download the dataset from Kaggle link (https://www.kaggle.com/c/zillow-prize-1/data) with the name Properties_2016.csv and download to local drive and read this file.
#Note : Due to size limitations in GitHub we are using the dataset which was downloaded to local drive

##properties <-  read.csv(file="/Users/juanelle/Desktop/MSDS/Data607/final project/Project ##Proposal/FinalProject_House.csv",row.names = NULL)

properties <-  read.csv(file="/Users/cwong79/Desktop/FinalProject_House_properties.csv", row.names = NULL)
download.file("https://www.dropbox.com/s/ne9l87yrzykr85v/Final_house.db?raw=1", 
              "FinalProject_House.sqlite" )

Read data from sqlite db

#load library
#install.packages('RSQLite')
library(RSQLite)
sqlite <- dbDriver("SQLite")
conn <- dbConnect(sqlite,"FinalProject_House.sqlite")
# Show all tables avaialbe in the sqllite
alltables = dbListTables(conn)
alltables
## [1] "AirConditioningTypeID"    "ArchitecturalStyleTypeID"
## [3] "BuildingClassTypeID"      "Column_Dictionary"       
## [5] "HeatingOrSystemTypeID"    "PropertyLandUseTypeID"   
## [7] "StoryTypeID"              "TypeConstructionTypeID"  
## [9] "train_2017"
str(properties)
## 'data.frame':    2985217 obs. of  58 variables:
##  $ parcelid                    : int  10754147 10759547 10843547 10859147 10879947 10898347 10933547 10940747 10954547 10976347 ...
##  $ airconditioningtypeid       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ architecturalstyletypeid    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ basementsqft                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ bathroomcnt                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedroomcnt                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ buildingclasstypeid         : int  NA NA 5 3 4 4 NA NA NA 3 ...
##  $ buildingqualitytypeid       : int  NA NA NA 6 NA 4 NA NA NA 4 ...
##  $ calculatedbathnbr           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ decktypeid                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ finishedfloor1squarefeet    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ calculatedfinishedsquarefeet: num  NA NA 73026 5068 1776 ...
##  $ finishedsquarefeet12        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ finishedsquarefeet13        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ finishedsquarefeet15        : int  NA NA 73026 5068 1776 2400 NA 3611 NA 3754 ...
##  $ finishedsquarefeet50        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ finishedsquarefeet6         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ fips                        : int  6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
##  $ fireplacecnt                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ fullbathcnt                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ garagecarcnt                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ garagetotalsqft             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hashottuborspa              : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ heatingorsystemtypeid       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ latitude                    : int  34144442 34140430 33989359 34148863 34194168 34171873 34131929 34171345 34218210 34289776 ...
##  $ longitude                   : int  -118654084 -118625364 -118394633 -118437206 -118385816 -118380906 -118351474 -118314900 -118331311 -118432085 ...
##  $ lotsizesquarefeet           : num  85768 4083 63085 7521 8512 ...
##  $ poolcnt                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ poolsizesum                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid10                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid2                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ pooltypeid7                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ propertycountylandusecode   : Factor w/ 235 levels "","0","010","0100",..: 14 12 150 150 162 162 22 162 14 162 ...
##  $ propertylandusetypeid       : int  269 261 47 47 31 31 260 31 269 31 ...
##  $ propertyzoningdesc          : Factor w/ 5652 levels "","**AHRP","#12",..: 1 2166 1821 1821 1843 1825 1821 574 598 4819 ...
##  $ rawcensustractandblock      : num  60378002 60378001 60377030 60371412 60371232 ...
##  $ regionidcity                : int  37688 37688 51617 12447 12447 12447 12447 396054 396054 47547 ...
##  $ regionidcounty              : int  3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
##  $ regionidneighborhood        : int  NA NA NA 27080 46795 46795 274049 NA NA NA ...
##  $ regionidzip                 : int  96337 96337 96095 96424 96450 96446 96049 96434 96436 96366 ...
##  $ roomcnt                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ storytypeid                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ threequarterbathnbr         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ typeconstructiontypeid      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ unitcnt                     : int  NA NA 2 NA 1 NA NA NA NA NA ...
##  $ yardbuildingsqft17          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ yardbuildingsqft26          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ yearbuilt                   : num  NA NA 1959 1948 1947 ...
##  $ numberofstories             : int  NA NA 1 1 1 1 NA 1 NA 1 ...
##  $ fireplaceflag               : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ structuretaxvaluedollarcnt  : num  NA NA 660680 580059 196751 ...
##  $ taxvaluedollarcnt           : num  9 27516 1434941 1174475 440101 ...
##  $ assessmentyear              : int  2016 2015 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ landtaxvaluedollarcnt       : num  9 27516 774261 594416 243350 ...
##  $ taxamount                   : num  NA NA 20800 14558 5725 ...
##  $ taxdelinquencyflag          : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ taxdelinquencyyear          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ censustractandblock         : num  NA NA NA NA NA NA NA NA NA NA ...

Data Transformation

Explore dataset

Missing Data

There are two types of missing data: MCAR: missing completely at random. This is the desirable scenario in case of missing data. MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear? Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function

Classification of missing data

pMiss <- function(x){sum(is.na(x))/length(x)*100}
missing.bycol <- apply(properties,2,pMiss)
missing.byrow <-  apply(properties,1,pMiss)
missdata.df <- as.data.frame(missing.bycol)
setDT(missdata.df, keep.rownames = TRUE)
names(missdata.df) <- c('Col_Names', 'pct_missing')

g<-ggplot(data = missdata.df , aes(x= reorder(Col_Names, pct_missing), y=pct_missing)) + geom_bar(stat = "identity",aes(fill = pct_missing), position = position_stack(reverse= TRUE)) + coord_flip()
g

Visulization the columns more than 20% missing value

#head(missdata.df)
missdata.df20 <- missdata.df %>% filter (pct_missing>=20) 
g1<-ggplot(data = missdata.df20  , aes(x= reorder(Col_Names, pct_missing), y=pct_missing)) + geom_bar(stat = "identity",aes(fill = pct_missing), position = position_stack(reverse= TRUE)) + coord_flip()
g1

Missing values that exist in more than 20% of samples may be removed from the data, which is called “80% rule”.

Remove variables with missing > 20%

mis_prop <- sapply(properties, function(x) sum(is.na(x))/length(x))
var_rm <- names(mis_prop)[mis_prop > 1 - 0.8]
var_rm 
##  [1] "airconditioningtypeid"    "architecturalstyletypeid"
##  [3] "basementsqft"             "buildingclasstypeid"     
##  [5] "buildingqualitytypeid"    "decktypeid"              
##  [7] "finishedfloor1squarefeet" "finishedsquarefeet13"    
##  [9] "finishedsquarefeet15"     "finishedsquarefeet50"    
## [11] "finishedsquarefeet6"      "fireplacecnt"            
## [13] "garagecarcnt"             "garagetotalsqft"         
## [15] "heatingorsystemtypeid"    "poolcnt"                 
## [17] "poolsizesum"              "pooltypeid10"            
## [19] "pooltypeid2"              "pooltypeid7"             
## [21] "regionidneighborhood"     "storytypeid"             
## [23] "threequarterbathnbr"      "typeconstructiontypeid"  
## [25] "unitcnt"                  "yardbuildingsqft17"      
## [27] "yardbuildingsqft26"       "numberofstories"         
## [29] "taxdelinquencyyear"

After removing variables with more than 20% missing values, dataset had 29 variables remaining. The rest of the analysis in this project will focus on these variables.

Glimpse of the dataset

df_rm_na <- properties[, !colnames(properties) %in% var_rm]

dim(df_rm_na)
## [1] 2985217      29
str(df_rm_na)
## 'data.frame':    2985217 obs. of  29 variables:
##  $ parcelid                    : int  10754147 10759547 10843547 10859147 10879947 10898347 10933547 10940747 10954547 10976347 ...
##  $ bathroomcnt                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bedroomcnt                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ calculatedbathnbr           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ calculatedfinishedsquarefeet: num  NA NA 73026 5068 1776 ...
##  $ finishedsquarefeet12        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ fips                        : int  6037 6037 6037 6037 6037 6037 6037 6037 6037 6037 ...
##  $ fullbathcnt                 : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ hashottuborspa              : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ latitude                    : int  34144442 34140430 33989359 34148863 34194168 34171873 34131929 34171345 34218210 34289776 ...
##  $ longitude                   : int  -118654084 -118625364 -118394633 -118437206 -118385816 -118380906 -118351474 -118314900 -118331311 -118432085 ...
##  $ lotsizesquarefeet           : num  85768 4083 63085 7521 8512 ...
##  $ propertycountylandusecode   : Factor w/ 235 levels "","0","010","0100",..: 14 12 150 150 162 162 22 162 14 162 ...
##  $ propertylandusetypeid       : int  269 261 47 47 31 31 260 31 269 31 ...
##  $ propertyzoningdesc          : Factor w/ 5652 levels "","**AHRP","#12",..: 1 2166 1821 1821 1843 1825 1821 574 598 4819 ...
##  $ rawcensustractandblock      : num  60378002 60378001 60377030 60371412 60371232 ...
##  $ regionidcity                : int  37688 37688 51617 12447 12447 12447 12447 396054 396054 47547 ...
##  $ regionidcounty              : int  3101 3101 3101 3101 3101 3101 3101 3101 3101 3101 ...
##  $ regionidzip                 : int  96337 96337 96095 96424 96450 96446 96049 96434 96436 96366 ...
##  $ roomcnt                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ yearbuilt                   : num  NA NA 1959 1948 1947 ...
##  $ fireplaceflag               : Factor w/ 2 levels "","true": 1 1 1 1 1 1 1 1 1 1 ...
##  $ structuretaxvaluedollarcnt  : num  NA NA 660680 580059 196751 ...
##  $ taxvaluedollarcnt           : num  9 27516 1434941 1174475 440101 ...
##  $ assessmentyear              : int  2016 2015 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ landtaxvaluedollarcnt       : num  9 27516 774261 594416 243350 ...
##  $ taxamount                   : num  NA NA 20800 14558 5725 ...
##  $ taxdelinquencyflag          : Factor w/ 2 levels "","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ censustractandblock         : num  NA NA NA NA NA NA NA NA NA NA ...

Descriptive Analyis of Data

Descriptive analysis one

#Subset assigned columns from original daqta set
col_index<-c(2:9)
juanelle<-df_rm_na[,col_index]
head(juanelle)
##   bathroomcnt bedroomcnt calculatedbathnbr calculatedfinishedsquarefeet
## 1           0          0                NA                           NA
## 2           0          0                NA                           NA
## 3           0          0                NA                        73026
## 4           0          0                NA                         5068
## 5           0          0                NA                         1776
## 6           0          0                NA                         2400
##   finishedsquarefeet12 fips fullbathcnt hashottuborspa
## 1                   NA 6037          NA               
## 2                   NA 6037          NA               
## 3                   NA 6037          NA               
## 4                   NA 6037          NA               
## 5                   NA 6037          NA               
## 6                   NA 6037          NA

Since the data under each of the above variables in this subset are categorical in nature, the reseacher chose to first use the table and summary function to gain a sense of the proportion of each level under each variable.

Bedroom Count

The variable ‘bedroomcnt’ provides information on the nummber of bedrooms in a home.

table(juanelle$bedroomcnt)
## 
##       0       1       2       3       4       5       6       7       8 
##  118705   86941  606782 1172757  731475  182765   48915   12763   13542 
##       9      10      11      12      13      14      15      16      17 
##    4279    1702     425     959      86      69      24      50      11 
##      18      19      20      21      23      24      25 
##       9       1       8       1       1       1       1
summary(juanelle$bedroomcnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   3.000   3.093   4.000  25.000    2945

The above summary shows that the homes in the data set have an average of 3.093 bedrooms.The maximum amount of bedrooms found in homes in the data set is 25 and the minimum 0. Homes with bedromm counts 19,21,23,24 and 25 are outliers in this analysis as shown in the boxplot below. It is obvious that the distribution is unimodal and skewed to the right.

boxplot(juanelle$bedroomcnt)

Bathroom count

Description of variable according to data dictionary: Number of bathrooms in home including fractional bathroom.

bathroomcnt<-table(juanelle$bathroomcnt)
bathroomcnt
## 
##       0     0.5       1     1.5    1.75       2     2.5       3     3.5 
##  113470      16  499332   45735       4 1219811  208809  633089   31835 
##       4     4.5       5     5.5       6     6.5       7     7.5       8 
##  133922   19864   38514    6275   16416    1352    6221     385    4548 
##     8.5       9     9.5      10    10.5      11    11.5      12    12.5 
##     113    1341      50     496      14     200       3     269       3 
##      13      14    14.5      15      16      17      18      19    19.5 
##      53      39       1      21      25       8      12       3       1 
##      20      31      32 
##       8       1       1
summary(juanelle$bathroomcnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   2.000   2.000   2.216   3.000  32.000    2957

Calculated bathroom number

Description of variable according to data dictionary: Number of bathrooms in home including fractional bathroom.

table(juanelle$calculatedbathnbr)
## 
##       1     1.5       2     2.5       3     3.5       4     4.5       5 
##  499324   45427 1219799  208578  633088   31773  133922   19811   38514 
##     5.5       6     6.5       7     7.5       8     8.5       9     9.5 
##    6259   16416    1340    6221     382    4548     110    1341      50 
##      10    10.5      11    11.5      12    12.5      13      14    14.5 
##     496      14     200       3     269       3      53      39       1 
##      15      16      17      18      19    19.5      20      31      32 
##      21      25       8      12       3       1       8       1       1
summary(juanelle$calculatedbathnbr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0     2.0     2.0     2.3     3.0    32.0  117156

From observation, bathroomcnt and calculatedbathnber provide the same information. However, the latter has far more missing values. Thus, bathoom count is the ideal variable to use to provide information on number of bathrooms, inclusive of fractional bathroom in homes. Shown below is a barplot visualising the distribution of bathroom count. The distribution is unimodal and skewed to the right.

barplot(bathroomcnt, main = "Bathroom Count")

Full bath count

Description of variable according to data dictionary: Number of full bathrooms (sink, shower + bathtub, and toilet) present in home

fullbath<-table(juanelle$fullbathcnt)
fullbath
## 
##       1       2       3       4       5       6       7       8       9 
##  544794 1428927  664914  153684   44721   17499    6468    4575    1347 
##      10      11      12      13      14      15      16      17      18 
##     495     197     268      54      39      20      25       8      12 
##      19      20      31      32 
##       4       8       1       1
summary(juanelle$fullbathcnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.00    2.00    2.25    3.00   32.00  117156

Based on the summary, the maximum number of full bathroom’s in the data set is 32 and minimum 1. Shown below are two visualisations of this information:

barplot(fullbath, main = " Full Bath Count")

> Thus visualisation shows that the distribution is unimodal and right skewed with the center being somewhere between two and three.

boxplot(juanelle$fullbathcnt)

>The box above shows that there are a significant amount of outliers in this data set.

Hot Tub or Spa

Description of variable according to data dictionary: This variable indicates whether or noy a home has a hot tub or spa

summary(juanelle$hashottuborspa)
##            true 
## 2935155   50062

Based on the summary on hottuborspa, There are 50062 homes in the data set with hottub or spa

Finished Square Feet

Description of variable according to data dictionary: Finished total living room area of home

# table(juanelle$calculatedfinishedsquarefeet)
summary(juanelle$calculatedfinishedsquarefeet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1    1215    1574    1832    2140  952576   45097

Finished Square feet 12

Description of variable according to data dictionary:Finished living area

#table(juanelle$finishedsquarefeet12)
summary(juanelle$finishedsquarefeet12)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1    1198    1542    1764    2075  427079  264431

Federal Information Processing Standard Code (FIPS)

Description of variable according to data dictionary:

table(juanelle$fips)
## 
##    6037    6059    6111 
## 2012741  745800  223744
summary(juanelle$fips)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6037    6037    6037    6048    6059    6111    2932

Descriptive analysis two

Regionidcity

Based on regionidcity variable, the grouped longitude and latitude values were extracted and mean longitude and latitude were calculated. This geographic interactive widget displays where regionid is in relation to a regional map. We can determine the dataset contains properties within Southern California.

properties %>% 
  count(regionidcity)
## # A tibble: 187 x 2
##    regionidcity     n
##           <int> <int>
##  1         3491  1076
##  2         3980     6
##  3         4406 21331
##  4         5465 13499
##  5         5534 49123
##  6         6021 14769
##  7         6285    19
##  8         6395  7772
##  9         6822   781
## 10         8384 16361
## # ... with 177 more rows
city_pal <- colorFactor("Set2", properties$regionidcity)

df_rm_na %>% 
  group_by(regionidcity = as.factor(regionidcity)) %>% 
  summarise(avg_lng = mean(longitude/1e6, na.rm = T),
            avg_lat = mean(latitude/1e6, na.rm =T)) %>% 
  leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircleMarkers(lng = ~avg_lng, lat = ~avg_lat, color = ~city_pal(regionidcity),
                   stroke = FALSE, fillOpacity = 0.8, popup = ~regionidcity)
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Connecting to Google Maps API and pulling Long/Lat location

To add detailed mapping to our analysis, a Google Map link was established using the Google Map API. This portion of code demonstrates the connection and map pull using the average longitude and latitude information above.

col_index<-c(10,11,16,23:29)
calvin<-df_rm_na[,col_index]

col_index<-c(1,2,5,6)
houses <- calvin[,col_index]
houses <- setDT(houses) %>% sample_frac(.01)
houses[, longitude := ifelse(longitude > 1e+10, longitude/1e+07, longitude/1e+06)]
houses[is.na(longitude), longitude := -118.255]
houses[, latitude := ifelse(is.na(latitude), 34.049, latitude/1e+06)]
houses[is.na(taxvaluedollarcnt), taxvaluedollarcnt := median(taxvaluedollarcnt, na.rm = TRUE)]

register_google(key = "AIzaSyCkdfkuRrcEbtgWtW6a2c5CNqjIjk6pfgY")
ggmap_credentials()
## Warning: 'ggmap_credentials' is deprecated.
## Use 'getOption("ggmap")' instead.
## See help("Deprecated")
## $google
## Key - AIzaSyCkdfkuRrcEbtgWtW6a2c5CNqjIjk6pfgY 
## Account Type - standard 
## Day Limit - Inf 
## Second Limit - Inf 
## Client -  
## Signature -  
## 
## $display_api_key
## [1] FALSE
## 
## attr(,"class")
## [1] "ggmap_credentials"
mapa <- get_map(location = c(lon = -118.255, lat = 34.049), zoom = 8, maptype = "hybrid") 
## Source : https://maps.googleapis.com/maps/api/staticmap?center=34.049,-118.255&zoom=8&size=640x640&scale=2&maptype=hybrid&language=en-EN&key=xxx

Rawcensustractandblock

Census tract and block ID combined - also contains blockgroup assignment by extension

Census tracts are geographic entities within counties and identified by a 4-digit basic code between 0001 and 9999, and may have a 2-digit suffix ranging from .01 to .98; for example, 6059.02. Census tracts are made up of census blocks. Even though a block code only occurs once in each tract, it may be used again in another tract. Blocks within tracts are identified by a 4-digit code.

If you paste these codes together, identifying each block requires the full 15 digits. For example, the census block containing the Office of the California Secretary of State, located at 1500 11th St, Sacramento, CA 95814, is fully identified by the code 06|067|001101|1085 06 – identifies California, 067 – identifies Sacramento County within California, 001101 – identifies Census Tract 11.01 within Sacramento County 1085 – identifies Census Block 1085 within tract 11.01.

Recommend to keep rawcensustractandblock

The raw data contained within rawcensustractandblock is unmodified and better suited for our analysis. We will therefore utilize this variable for further analysis.

Censustractandblock

Census tract and block ID combined - also contains blockgroup assignment by extension

Assessmentyear

The year of the property tax assessment

Here, we were able to utilize the Google Map pull and insert assessmentyear variable unto it:

ggmap(mapa) +
  geom_point(data = houses, aes(x = longitude, y = latitude, color = assessmentyear, alpha = 0.8)) +
  scale_colour_gradient2(low = "darkred", mid = "white", high = "yellow", midpoint = 2015, space = "Lab", na.value =     "grey50", guide = "colourbar") +
  theme(legend.position = "null", axis.title.x = element_blank(), axis.title.y = element_blank())

Landtaxvaluedollarcnt

The assessed value of the land area of the parcel

Assessors generally use the market value of land to help determine the assessed land value of a property. If private sales of similar properties cannot be located, assessors may have to use development cost to help determine assessed values. Land information can include recent sales, location, permitted land use and other market related factors. When assessors determine the value of the property, they include the land at its market value, plus the buildings and other structures at their depreciated replacement cost. Once the assessors calculate the fair value of the land, they then calculate the value of the buildings on the land.

summary(calvin$landtaxvaluedollarcnt)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##        1    79700   176619   268456   326100 94011079    59926

Structuretaxvaluedollarcnt

The assessed value of the built structure on the parcel

Improvements refer to buildings and permanent structures on the land. The most common way to determine these values is to consider how much money it would take at the current material and labor costs to replace your building with one like it. If the building is not new, consideration must also be given to how much it has depreciated. To be able to determine the value of any piece of property, it is necessary for the assessor to know what it might cost today to replace it, the selling price of similar properties and many other facts. Building information can include building size, features, type of construction, materials used, age, condition and quality of construction. Also considered are factors such as flooring, plumbing fixtures, furnaces, and finished basement rooms.

summary(calvin$structuretaxvaluedollarcnt)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1     77666    127066    178143    204000 255321161     46464

Taxvaluedollarcnt

The total tax assessed value of the parcel

Property taxes are calculated on the assessed value of property. Property assessment provides the basis for determining each person’s fair share of taxes in proportion to the amount his or her individual property is worth. The final assessed value is used by government (territorial, city, town, village, etc….) to determine the amount of taxes payable by the property owner.

Here, we were able to utilize the Google Map pull and insert taxvaluedollarcnt variable unto it.

summary(calvin$taxvaluedollarcnt)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##         1    188220    321161    443528    514072 319622473     34266
ggmap(mapa) +
  geom_point(data = houses, aes(x = longitude, y = latitude, color = taxvaluedollarcnt, alpha = 1/1000)) +
  scale_colour_gradient2(low = "darkred", mid = "white", high = "yellow", midpoint = 250000, space = "Lab", na.value =     "grey50", guide = "colourbar") +
  theme(legend.position = "null", axis.title.x = element_blank(), axis.title.y = element_blank())

Taxamount

The total property tax assessed for that assessment year

Assessors do not collect taxes or determine the amount of taxes to be collected. This is left up to the Municipality in which in which the property recides. Assessors do determine the fair value of a property so that taxes can be calculated.

plt <- qplot(calvin$taxamount, geom="histogram", binwidth = 1000, main = "Distribution of Tax Amount", ylab =    "Frequency", xlab       = "Tax Amount", fill=I("blue"), col=I("red"), alpha=I(.2), xlim=c(5000,30000)) + 
      theme(axis.text.y = element_text(angle=45))

suppressWarnings(print(plt))

Tax Formula

landtaxvaluedollarcnt + structuretaxvaluedollarcnt = taxvaluedollarcnt

taxamount = taxable % (determined by municipality) * taxvaluedollarcnt

Taxdelinquencyflag

Property taxes for this parcel are past due as of 2015

Tax payments must be postmarked by the due date in order to be processed as a timely payment. This variable shows properties with a past due property tax as of 2015. This could mean that the property has a property-tax lien on it. A property-tax lien is a legal claim against a property for unpaid property taxes. A tax lien prohibits a property from being sold or refinanced until the taxes are paid and the lien is removed.

Descriptive analysis three

#str(df_rm_na)

#what are property types like-count
PropertyLandUsedDesc <- function(propertylandusetypeid)
{  propertylandusetypeid<- as.character(propertylandusetypeid)
if (propertylandusetypeid== 31)
{
   return ('Commercial/Office/Residential Mixed Used')

}
else if (propertylandusetypeid== 46)
{
   return ('Multi-Story Store')
}
else if (propertylandusetypeid== 47)
{
   return ('Store/Office (Mixed Use)')
}
else if (propertylandusetypeid== 246)
{
   return ('Duplex (2 Units, Any Combination)')
}
else if (propertylandusetypeid== 247)
{
   return ('Triplex (3 Units, Any Combination)')
}
else if (propertylandusetypeid== 248)
{
   return ('Quadruplex (4 Units, Any Combination)')
}
else if (propertylandusetypeid== 260)
{
   return ('Residential General')
}
else if (propertylandusetypeid== 261)
{
   return ('Single Family Residential')
}
else if (propertylandusetypeid== 262)
{
   return ('Rural Residence')
}
else if (propertylandusetypeid== 263)
{
   return ('Mobile Home')
}
else if (propertylandusetypeid== 264)
{
   return ('Townhouse')
}
else if (propertylandusetypeid== 265)
{
   return ('Cluster Home')
}
else if (propertylandusetypeid== 266)
{
   return ('Condominium')
}
else if (propertylandusetypeid== 267)
{
   return ('Cooperative')
}
else if (propertylandusetypeid== 268)
{
   return ('Row House')
}
else if (propertylandusetypeid== 269)
{
   return ('Planned Unit Development')
}
else if (propertylandusetypeid== 270)
{
   return ('Residential Common Area')
}
else if (propertylandusetypeid== 271)
{
   return ('Timeshare')
}
else if (propertylandusetypeid== 273)
{
   return ('Bungalow')
}
else if (propertylandusetypeid== 274)
{
   return ('Zero Lot Line')
}
else if (propertylandusetypeid== 275)
{
   return ('Manufactured, Modular, Prefabricated Homes')
}
else if (propertylandusetypeid== 276)
{
   return ('Patio Home')
}
else if (propertylandusetypeid== 279)
{
   return ('Inferred Single Family Residential')
}
else if (propertylandusetypeid== 290)
{
   return ('Vacant Land - General')
}
else if (propertylandusetypeid==291)
{
   return ('Residential Vacant Land')
}
else
{
   return ('Not Determined')
}
}

#visualisation of property types count

Machine Learning Model

Using Machine learning model to figure out whether we can predict the Zillow house price (Close to acutal logerror )

Load train table

### load dataset with logerror 
traindf <- dbGetQuery(conn, "select * from train_2017")
### mearge datasets from properities and dataset with logerror
target_df <- merge.data.frame(traindf, df_rm_na,by="parcelid")

Preparation of the datast for ML.

Data Cleansing

# Organize numeric and and categorical data
#Convert some coulumns such as propertylandusetypeid , fips to category data 
target_df$propertylandusetypeid<-factor(target_df$propertylandusetypeid)
target_df$fips <- factor (target_df$fips)

Dealing with missing binary values in columns . Change all NA value to -1 according to Kaggle suggestions

### build function to fill in NA value and clean data in the fireplaceflag column
flagColClean <- function(emp)
{  emp<- as.character(emp)
if (emp=="NA"  | emp =="")
{
   
    return (-1)
}

else 
{
    return (1)
}

}

target_df$fireplaceflag<- sapply( target_df$fireplaceflag, flagColClean)
target_df$fireplaceflag<- factor( target_df$fireplaceflag )
### build function to fill in NA value and clean data in the ftaxdelinquencyflag column
flagcol <- target_df %>% select (contains('flag'))

flagColClean <- function(emp)
{  emp<- as.character(emp)
if (emp=="NA"  | emp =="")
{
   
    return (-1)
}

else 
{
    return (1)
}

}

target_df$taxdelinquencyflag<- sapply( target_df$taxdelinquencyflag, flagColClean)
target_df$taxdelinquencyflag<- factor( target_df$taxdelinquencyflag )


###  Extract Month from transactionDate variable
 
target_df$transactionMonth <- as.POSIXlt(target_df$transactiondate, format="%Y-%m-%d") $mon

Get unique value per parcelid

### Make sure there is no duplicated rows per parcelid  (Remove duplicate records based on the parcelid)
target_df <- target_df %>% distinct(parcelid, .keep_all = TRUE)
rownames(target_df) <- target_df$parcelid

#remove all rows with NA value
target_df2 <- target_df[complete.cases(target_df), ]

Feature Engineering

The purpose of this section is to reorganize some columns in the dataset and to create new variables.

# Create yearbuilt col to caluclate the life of property
target_df$yearbuilt[is.na(target_df$yearbuilt)]<-median(target_df$yearbuilt,na.rm=T)
target_df2$N.HousYear = 2018 -target_df2$yearbuilt

# create .LivingAreaProp col to find the living area per total lot 
target_df2$N.LivingAreaProp =target_df2$calculatedfinishedsquarefeet /target_df2$lotsizesquarefeet

#  Create N.ExtraSpace to indicate the size of area besides of the living area 
target_df2$N.ExtraSpace =target_df2$lotsizesquarefeet -target_df2$calculatedfinishedsquarefeet
  
# Total number of bed rooms
target_df2$N.TotalRooms =target_df2$bathroomcnt +target_df2$bedroomcnt

#  create var to indicate the average room size
target_df2$N.AvRoomSize  =target_df2$calculatedfinishedsquarefeet/target_df2$N.TotalRooms

#  create var to indicate how many extra rooms avaiable besides of bed romm
target_df2$N.ExtraRooms =target_df2$roomcnt -target_df2$N.TotalRooms

# Ratio of the built structure value to land area
target_df2$N.ValueProp =target_df2$structuretaxvaluedollarcnt /target_df2$landtaxvaluedollarcnt
### Deal with latitude and longitude data 
target_df2$latitude = as.numeric(target_df2$latitude)
target_df2$longitude= as.numeric(target_df2$longitude)
target_df2$N.location  = target_df2$latitude +target_df2$longitude 
target_df2$N.location2  = as.numeric(target_df2$latitude/target_df2$longitude )

#  Count the number of properties per region 
city_count  <-  count(target_df2, regionidcity)
names(city_count) <- c('regionidcity','City_Count')
target_df2<-merge(x=target_df2,y=city_count,by="regionidcity" ,all=TRUE)

#  Count the number of properties per  County
county_count  <-  count(target_df2, regionidcounty)
names(county_count) <- c('regionidcounty','County_Count')
target_df2<-merge(x=target_df2,y=county_count,by="regionidcounty" ,all=TRUE)

 #  Count the number of properties per area (located in the same zip area)
zip_count  <-  count(target_df2, regionidzip)
names(zip_count) <- c('regionidzip','Zip_Count')
target_df2<-merge(x=target_df2,y= zip_count  ,by="regionidzip" ,all=TRUE)

# Create variable to get the amount of tax per 
target_df2$N.ValueRatio  =   target_df2$taxamount /target_df2$taxvaluedollarcnt

# Create variable to get the total amount of tax 
target_df2$N.TaxScore  =target_df2$taxvaluedollarcnt*target_df2$taxamount 

Finalize features

## remove columns with free text 
target_df2 <- target_df2 %>% select (-contains('propertyzoningdesc'), -contains('itude'), -starts_with('year'),-contains('itude'), -contains('rowcensus'),-contains('transactiondate'), -contains('regionid'), -contains('assessmentyear'))  

## list all datatype of datafame 
split(names(target_df2),sapply(target_df2, function(x) paste(class(x), collapse=" ")))
## $factor
## [1] "fips"                      "hashottuborspa"           
## [3] "propertycountylandusecode" "propertylandusetypeid"    
## [5] "fireplaceflag"             "taxdelinquencyflag"       
## 
## $integer
## [1] "parcelid"             "finishedsquarefeet12" "fullbathcnt"         
## [4] "transactionMonth"     "City_Count"           "County_Count"        
## [7] "Zip_Count"           
## 
## $numeric
##  [1] "logerror"                     "bathroomcnt"                 
##  [3] "bedroomcnt"                   "calculatedbathnbr"           
##  [5] "calculatedfinishedsquarefeet" "lotsizesquarefeet"           
##  [7] "rawcensustractandblock"       "roomcnt"                     
##  [9] "structuretaxvaluedollarcnt"   "taxvaluedollarcnt"           
## [11] "landtaxvaluedollarcnt"        "taxamount"                   
## [13] "censustractandblock"          "N.HousYear"                  
## [15] "N.LivingAreaProp"             "N.ExtraSpace"                
## [17] "N.TotalRooms"                 "N.AvRoomSize"                
## [19] "N.ExtraRooms"                 "N.ValueProp"                 
## [21] "N.location"                   "N.location2"                 
## [23] "N.ValueRatio"                 "N.TaxScore"

Normalizing the Numeric Data

In this section we will normalize the numeric data so as to change the shape of the distribution.

### Find all numerical and categorical columns

numcol <- sapply(target_df2,is.numeric) 

### Manuaully change logerror and awcensustractandblock into categorical data because we do not want to normalize those two columns. 
numcol['logerror'] <- FALSE
numcol ['rawcensustractandblock'] <- FALSE
numcol_df<- as.data.frame(numcol)
numcol_df <- setDT(numcol_df, keep.rownames = TRUE)[]
names(numcol_df) <- c ("ColName","IsNumeric")
Numeric_col <- numcol_df %>% filter (IsNumeric==TRUE)
Numeric_col <- Numeric_col$ColName


###  build scaling function 
normfun <- function(x) (x - min(x))/(max(x)-min(x))

### apply the function on all the numeric data 
for (i in Numeric_col)
{
    {target_df2[i] <- lapply(target_df2[i], normfun)}
  
}

Further Data Cleansing

There are some categorical variables with text data or multiple levels. These variables needed to be further cleaned up.

###  change 
target_df2$fips = as.numeric(factor(target_df2$fips, 
                                      levels = c("6037", "6059" ,"6111"),
                                      labels = c(1, 2, 3)))
 

target_df2$hashottuborspa = as.numeric(factor(target_df2$hashottuborspa, 
                                      levels = c(""   ,  "true"),
                                      labels = c(-1, 1)))

Convert propertycountylandusecode

Transform categorical variables

landCodeClean <- function(emp)
{  emp<- as.character(emp)
if (emp=="0100" )
{
   
    return (1)
}

else if (emp=="22") 
{
    return (2)
}

else if (emp=="010C") 
{
    return (3)
}
else 
{
  return (4)
}

}



target_df2$propertycountylandusecode<- sapply( target_df2$propertycountylandusecode, landCodeClean)
target_df2$propertycountylandusecode<- factor(target_df2$propertycountylandusecode )
propertypeClean <- function(emp)
{  emp<- as.character(emp)
if (emp=="261" )
{
   
    return (1)
}

else if (emp=="266") 
{
    return (2)
}

else if (emp=="269") 
{
    return (3)
}
else 
{
  return (4)
}

}


target_df2$propertylandusetypeid<- sapply( target_df2$propertylandusetypeid, propertypeClean)
target_df2$propertylandusetypeid<- factor( target_df2$propertylandusetypeid )

Finalized_dataset

dataset <- target_df2

str(dataset)
## 'data.frame':    64065 obs. of  37 variables:
##  $ parcelid                    : num  0.00976 0.00975 0.0097 0.00979 0.00976 ...
##  $ logerror                    : num  0.15904 0.62487 0.22207 -0.05362 -0.00454 ...
##  $ bathroomcnt                 : num  0 0 0.167 0 0 ...
##  $ bedroomcnt                  : num  0.273 0.182 0.545 0.182 0.182 ...
##  $ calculatedbathnbr           : num  0 0 0.167 0 0 ...
##  $ calculatedfinishedsquarefeet: num  0.0489 0.0408 0.1266 0.0253 0.0287 ...
##  $ finishedsquarefeet12        : num  0.0489 0.0408 0.1266 0.0253 0.0287 ...
##  $ fips                        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ fullbathcnt                 : num  0 0 0.167 0 0 ...
##  $ hashottuborspa              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ lotsizesquarefeet           : num  0.000712 0.000769 0.001014 0.000698 0.000658 ...
##  $ propertycountylandusecode   : Factor w/ 3 levels "1","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ propertylandusetypeid       : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ rawcensustractandblock      : num  60372395 60375350 60375330 60372398 60375350 ...
##  $ roomcnt                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fireplaceflag               : Factor w/ 2 levels "-1","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ structuretaxvaluedollarcnt  : num  0.00664 0.00141 0.02021 0.00102 0.00107 ...
##  $ taxvaluedollarcnt           : num  0.011654 0.00089 0.012117 0.000816 0.000726 ...
##  $ landtaxvaluedollarcnt       : num  0.010649 0.000566 0.005607 0.000641 0.00052 ...
##  $ taxamount                   : num  0.01275 0.00255 0.01456 0.00112 0.00286 ...
##  $ taxdelinquencyflag          : Factor w/ 2 levels "-1","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ censustractandblock         : num  3.27e-06 1.03e-05 1.02e-05 3.28e-06 1.03e-05 ...
##  $ transactionMonth            : num  1 0.625 0 0.375 0.75 0.125 0.375 0.5 0.75 0.375 ...
##  $ N.HousYear                  : num  0.645 0.674 0.478 0.652 0.652 ...
##  $ N.LivingAreaProp            : num  0.0224 0.0178 0.0381 0.0132 0.0154 ...
##  $ N.ExtraSpace                : num  0.000932 0.001014 0.000991 0.000991 0.00094 ...
##  $ N.TotalRooms                : num  0.15 0.1 0.4 0.1 0.1 0.2 0.1 0.2 0.1 0.05 ...
##  $ N.AvRoomSize                : num  0.0948 0.1091 0.1012 0.0713 0.0798 ...
##  $ N.ExtraRooms                : num  0.607 0.643 0.429 0.643 0.643 ...
##  $ N.ValueProp                 : num  0.000418 0.001659 0.002414 0.001072 0.001374 ...
##  $ N.location                  : num  0.402 0.411 0.417 0.4 0.41 ...
##  $ N.location2                 : num  0.581 0.577 0.57 0.585 0.576 ...
##  $ City_Count                  : num  1 0.00568 0.00568 1 0.00568 ...
##  $ County_Count                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Zip_Count                   : num  0.0857 0.0857 0.0857 0.0857 0.0857 ...
##  $ N.ValueRatio                : num  0.0251 0.0597 0.0276 0.0291 0.0793 ...
##  $ N.TaxScore                  : num  1.52e-04 2.56e-06 1.81e-04 1.00e-06 2.39e-06 ...
dataset <- dataset %>% select(-contains("parcelid")) 

Use of H2o to build machine learning models

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

#install.packages('h2o')
######################################
# Setup h2o
######################################
h2o.init(nthreads = -1, max_mem_size = '8g', ip = "127.0.0.1")
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         37 minutes 1 seconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    2 months and 18 days  
##     H2O cluster name:           H2O_started_from_R_cwong79_ndv646 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   7.89 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          127.0.0.1 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)
###  split dataset  Best approach by using the Gridsearch  (Skipped in the project )

datah2o <- as.h2o(dataset)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
splits <- h2o.splitFrame(data = datah2o, 
                         ratios = c(0.7, 0.15),  #
                         seed = 1)  #setting a seed will guarantee reproducibility

### split dataset into train , valid and test dataset 
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]


# Identify predictors and response
### Y is our target variable to predict 
y <- "logerror"

### chose all variables except our taraget variable
x <- names(train)[which(names(train)!="logerror")]

### list all features /varaiables
print(x)
##  [1] "bathroomcnt"                  "bedroomcnt"                  
##  [3] "calculatedbathnbr"            "calculatedfinishedsquarefeet"
##  [5] "finishedsquarefeet12"         "fips"                        
##  [7] "fullbathcnt"                  "hashottuborspa"              
##  [9] "lotsizesquarefeet"            "propertycountylandusecode"   
## [11] "propertylandusetypeid"        "rawcensustractandblock"      
## [13] "roomcnt"                      "fireplaceflag"               
## [15] "structuretaxvaluedollarcnt"   "taxvaluedollarcnt"           
## [17] "landtaxvaluedollarcnt"        "taxamount"                   
## [19] "taxdelinquencyflag"           "censustractandblock"         
## [21] "transactionMonth"             "N.HousYear"                  
## [23] "N.LivingAreaProp"             "N.ExtraSpace"                
## [25] "N.TotalRooms"                 "N.AvRoomSize"                
## [27] "N.ExtraRooms"                 "N.ValueProp"                 
## [29] "N.location"                   "N.location2"                 
## [31] "City_Count"                   "County_Count"                
## [33] "Zip_Count"                    "N.ValueRatio"                
## [35] "N.TaxScore"
## Show number of records in the each dataset 
nrow(train)  # 114908
## [1] 44851
nrow(valid) # 24498
## [1] 9557
nrow(test)  # 24581
## [1] 9657

Basic implementation by using default parameter

We chose Boosting algorithms as our ML method .We chose boosting algorithms because they are extremely popular machine learning algorithms that have proven successful across many domains and is one of the leading methods for winning Kaggle competitions. Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

For future modelling pruposes

Default parameter in H2O

# library(h2o)
# h2o.init(nthreads = -1, max_mem_size = "8g")
# 
#  
# ###  split dataset  Best approach by using the Gridsearch  (Skipped in the project )
# 
# datah2o <- as.h2o(dataset)
# splits <- h2o.splitFrame(data = datah2o, 
#                          ratios = c(0.7, 0.15),  #
#                          seed = 1)  #setting a seed will guarantee reproducibility
# train <- splits[[1]]
# valid <- splits[[2]]
# test <- splits[[3]]
# 
# 
# # Identify predictors and response
# x <- names(train)[which(names(train)!="logerror")]
# y <- "logerror"
# print(x)
#gbm.fit <- h2o.gbm(
#  x = x,
#  y = y,
#  training_frame = train,
#  nfolds = 5,
#  ntrees = 5000,
 # stopping_rounds = 10,
 # stopping_tolerance = 0,
 # seed = 123
#)
#h2o.varimp_plot(gbm.fit , num_of_features = 15) 
#h2o.mae(gbm.fit)
#h2o.shutdown()

Tuning of the model

We changed the following parameters of the model:

Tree complexity:ntrees: number of trees to train max_depth: depth of each tree min_rows: Fewest observations allowed in a terminal node Learning rate: learn_rate: rate to descend the loss function gradient learn_rate_annealing: allows you to have a high initial learn_rate, then gradually reduce as trees are added (speeds up training). Adding stochastic nature: sample_rate: row sample rate per tree col_sample_rate: column sample rate per tree (synonymous with xgboost’s colsample_bytree)

# create hyperparameter grid
hyper_grid <- expand.grid(
  shrinkage = c(.01, .1, .3),
  interaction.depth = c(1, 3, 5),
  n.minobsinnode = c(5, 10, 15),
  bag.fraction = c(.65, .8, 1), 
  optimal_trees = 0,               # a place to dump results
  min_RMSE = 0                     # a place to dump results
)


search_criteria <- list(
  strategy = "RandomDiscrete",
  stopping_metric = "mse",
  stopping_tolerance = 0.005,
  stopping_rounds = 10,
  max_runtime_secs = 60*60
  )
#
# perform grid search 
grid <- h2o.grid(
  algorithm = "gbm",
  grid_id = "gbm_grid2",
  x = x, 
  y = y, 
  training_frame = train,
  validation_frame = valid,
  #hyper_params = hyper_grid,
  search_criteria = search_criteria, # add search criteria
  ntrees = 5000,
  stopping_rounds = 10,
  stopping_tolerance = 0,
  seed = 123
  )
## 
## ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://127.0.0.1:54321/99/Grid/gbm)
## 
## water.exceptions.H2OIllegalArgumentException
##  [1] "water.exceptions.H2OIllegalArgumentException: Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input"
##  [2] "    hex.grid.GridSearch.start(GridSearch.java:103)"                                                                                                                
##  [3] "    hex.grid.GridSearch.startGridSearch(GridSearch.java:449)"                                                                                                      
##  [4] "    hex.grid.GridSearch.startGridSearch(GridSearch.java:397)"                                                                                                      
##  [5] "    water.api.GridSearchHandler.handle(GridSearchHandler.java:103)"                                                                                                
##  [6] "    water.api.GridSearchHandler.handle(GridSearchHandler.java:36)"                                                                                                 
##  [7] "    water.api.RequestServer.serve(RequestServer.java:451)"                                                                                                         
##  [8] "    water.api.RequestServer.doGeneric(RequestServer.java:296)"                                                                                                     
##  [9] "    water.api.RequestServer.doPost(RequestServer.java:222)"                                                                                                        
## [10] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"                                                                                                  
## [11] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"                                                                                                  
## [12] "    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"                                                                                        
## [13] "    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)"                                                                                    
## [14] "    org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"                                                                            
## [15] "    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:429)"                                                                                     
## [16] "    org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"                                                                             
## [17] "    org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"                                                                                 
## [18] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                         
## [19] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                               
## [20] "    water.JettyHTTPD$LoginHandler.handle(JettyHTTPD.java:197)"                                                                                                     
## [21] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                                         
## [22] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                                               
## [23] "    org.eclipse.jetty.server.Server.handle(Server.java:370)"                                                                                                       
## [24] "    org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"                                                                
## [25] "    org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"                                                                 
## [26] "    org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)"                                                                      
## [27] "    org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)"                                                      
## [28] "    org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)"                                                                                              
## [29] "    org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)"                                                                                         
## [30] "    org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"                                                                        
## [31] "    org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"                                                                  
## [32] "    org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"                                                                              
## [33] "    org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"                                                                               
## [34] "    java.base/java.lang.Thread.run(Thread.java:834)"
## Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : 
## 
## ERROR MESSAGE:
## 
## Illegal argument: training_frame of function: grid: Cannot append new models to a grid with different training input
# collect the results and sort by our model performance metric of choice
grid_perf <- h2o.getGrid(
  grid_id = "gbm_grid2", 
  sort_by = "mse", 
  decreasing = FALSE
  )
grid_perf
## H2O Grid Details
## ================
## 
## Grid ID: gbm_grid2 
## Used hyper parameters: 
## Number of models: 2 
## Number of failed models: 0 
## 
## Hyper-Parameter Search Summary: ordered by increasing mse
##           model_ids                  mse
## 1 gbm_grid2_model_1  0.02194353160742622
## 2 gbm_grid2_model_0 0.022881018848163435
#h2o.shutdown()

Select best performing model.

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated.

We identified that the top model is through the lowest MSE score.

best_model_id <- grid_perf@model_ids[[1]]
best_model <- h2o.getModel(best_model_id)
h2o.performance(model = best_model, valid = TRUE)
## H2ORegressionMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.02194353
## RMSE:  0.1481335
## MAE:  0.06644435
## RMSLE:  NaN
## Mean Residual Deviance :  0.02194353

Create Final data model based on the best parameters found

# Using the best parameters to run the ML model 
h2o.final <- h2o.gbm(
  x = x,
  y = y,
  training_frame = train,
  nfolds = 5,
  ntrees = 10000,
  learn_rate = 0.05,
  learn_rate_annealing = 0.99,
  max_depth = 5,
  min_rows = 10,
  sample_rate = 0.75,
  col_sample_rate = 1,
  stopping_rounds = 10,
  stopping_tolerance = 0,
  seed = 123
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |=================================================================| 100%

Check model’s performance

We found that the MSE score is 0.026 with the current model.

h2o.performance(model =best_model , newdata = as.h2o(test))
## H2ORegressionMetrics: gbm
## 
## MSE:  0.0229929
## RMSE:  0.1516341
## MAE:  0.06721181
## RMSLE:  NaN
## Mean Residual Deviance :  0.0229929

Visualizing variable influence on house prices

After re-running our final model we likely want to understand the variables that have the largest influence on house sale price. We found that variables such as house build, year and tax amount are strongly associated with the house price.

h2o.varimp_plot(h2o.final, num_of_features = 15)

We can now apply the model to our two observations. The results show the predicted value, local model fit (both are relatively poor), and the most influential variables driving the predicted value for each observation.

We used LIME function (LIME is a newer procedure for understanding why a prediction resulted in a given value for a single observation.)

We can now apply to our two observations. The results show the predicted value (Case 1: 0.0141, Case 2: 0.00997).

#install.packages('lime')

 
local_obs <- as.data.frame(test[1:2, ])
local_obs$logerror
## [1] -0.448155306 -0.001193907
explainer <- lime(as.data.frame(train), h2o.final)
explanation <- explain(local_obs, explainer, n_features = 5)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
plot_features(explanation)
## Warning: Unknown or uninitialised column: 'label'.

#h2o.shutdown()

Predicting

h2o.performance(model = h2o.final, newdata = as.h2o(test))
## H2ORegressionMetrics: gbm
## 
## MSE:  0.02295649
## RMSE:  0.151514
## MAE:  0.06699506
## RMSLE:  NaN
## Mean Residual Deviance :  0.02295649

Summary

The MAE score provided by the current model is close to other models published on the Kaggle website. This result can be further improved with grid-search and ensemble methods.