Introduction

Price of a property is one of the most important decision criterion when people buy homes. Real state firms need to be consistent in their pricing in order to attract buyers . Having a predictive model for the same will be great tool to have , which in turn can also be used to tweak development of properties , putting more emphasis on qualities which increase the value of the property.

Problem Statement

We have been given two datasets , housing_train.csv and housing_test.csv . We need to use data housing_train to build predictive model for response variable “Price”. Housing_test data contains all other factors except “Price”, we need to predict that using the model that we developed.

Setting our working directory

setwd("D:/Edvancer/R Tutorials/R Projects Codes/P1-Real Estate")

Loading the data (Train & test both)

h_train=read.csv("housing_train.csv",stringsAsFactors = F)
h_test=read.csv("housing_test.csv",stringsAsFactors = F)

Data Dictionary

Each row represnts charactaristic of a single property . Many categorical data has been coded to mask the data

Suburb : categorical :: Which subsurb the property is located in

Address : categorical :: short address

Rooms : numeric :: Number of Rooms

Type : categorical :: type of the property

Price : numeric :: This is the target variable, price of the property

Method : categorical :: method for selling

SellerG : categorical :: Name of the seller

Distance : numeric :: distance from the city center Postcode : categorical :: postcode of the property

Bedroom2 : Numeric :: numbers of secondary bedrooms (this is different from rooms)

Bathroom : numeric :: number of bathrooms

Car : numeric :: number of parking spaces

Landsize : numeric :: landsize

BuildingArea : numeric :: buildup area

YearBuilt : numeric :: year of building

CouncilArea : numeric :: council area to which the propery belongs

Glimpse of our Data

Let us see first few records of our train & test data using ‘head’ function as shown below. It is evident that our test data contains every fields except response variable ‘Price’.

head(h_train)

##          Suburb        Address Rooms Type   Price Method SellerG Distance
## 1     Brunswick    52 Evans St     3    h 1650000      S  Nelson      5.2
## 2     Reservoir  85 Radford Rd     5    h  791000      S     Ray     11.2
## 3       Newport 99 Anderson St     3    h  785000      S      RT      8.4
## 4 Brighton East 4/377 South Rd     2    u  755000     SP  Buxton     10.7
## 5 Hawthorn East    3 Jaques St     5    h 2500000     VB      RT      7.5
## 6 Hawthorn East    75 Leura Gr     3    h 3020000      S  Hooper      7.5
##   Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1     3056        3        1   2      495          141      1920
## 2     3073        4        3   1      961           NA        NA
## 3     3015        3        1   1      185           NA        NA
## 4     3187       NA       NA  NA       NA           NA        NA
## 5     3123        5        3   3      757          240      1925
## 6     3123        3        2   2      832           NA        NA
##   CouncilArea
## 1    Moreland
## 2     Darebin
## 3 Hobsons Bay
## 4            
## 5  Boroondara
## 6  Boroondara

head(h_test)

##       Suburb             Address Rooms Type Method      SellerG Distance
## 1 Abbotsford  6/241 Nicholson St     1    u      S       Biggin      2.5
## 2 Abbotsford 403/609 Victoria St     2    u      S       Dingle      2.5
## 3 Abbotsford   106/119 Turner St     1    u     SP Purplebricks      2.5
## 4 Abbotsford          22 Park St     4    h      S       Biggin      2.5
## 5 Abbotsford         78 Yarra St     3    h      S       LITTLE      2.5
## 6 Abbotsford  13/11 Nicholson St     3    t      S       Beller      2.5
##   Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1     3067        1        1   1        0           NA        NA
## 2     3067       NA       NA  NA       NA           NA        NA
## 3     3067       NA       NA  NA       NA           NA        NA
## 4     3067       NA       NA  NA       NA           NA        NA
## 5     3067        2        1   1      138          105      1890
## 6     3067        3        2   2        0           NA      2010
##   CouncilArea
## 1       Yarra
## 2            
## 3            
## 4            
## 5       Yarra
## 6       Yarra

In order to glimpse data & its data type we need to use function glimpse from the package ‘dplyr’.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

glimpse(h_train)

## Observations: 7,536
## Variables: 16
## $ Suburb       <chr> "Brunswick", "Reservoir", "Newport", "Brighton Ea...
## $ Address      <chr> "52 Evans St", "85 Radford Rd", "99 Anderson St",...
## $ Rooms        <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2, 2, 4, 3, 2...
## $ Type         <chr> "h", "h", "h", "u", "h", "h", "h", "h", "h", "u",...
## $ Price        <int> 1650000, 791000, 785000, 755000, 2500000, 3020000...
## $ Method       <chr> "S", "S", "S", "SP", "VB", "S", "VB", "VB", "PI",...
## $ SellerG      <chr> "Nelson", "Ray", "RT", "Buxton", "RT", "Hooper", ...
## $ Distance     <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9, 11.2, 12.8,...
## $ Postcode     <int> 3056, 3073, 3015, 3187, 3123, 3123, 3165, 3127, 3...
## $ Bedroom2     <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2, 2, 2, 4, N...
## $ Bathroom     <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1, 2, 1, 2, N...
## $ Car          <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1, 1, 1, 1, N...
## $ Landsize     <int> 495, 961, 185, NA, 757, 832, 710, 816, NA, 0, NA,...
## $ BuildingArea <int> 141, NA, NA, NA, 240, NA, NA, NA, NA, 80, NA, 69,...
## $ YearBuilt    <int> 1920, NA, NA, NA, 1925, NA, 1966, NA, NA, 2003, N...
## $ CouncilArea  <chr> "Moreland", "Darebin", "Hobsons Bay", "", "Boroon...

glimpse(h_test)

## Observations: 1,885
## Variables: 15
## $ Suburb       <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbots...
## $ Address      <chr> "6/241 Nicholson St", "403/609 Victoria St", "106...
## $ Rooms        <int> 1, 2, 1, 4, 3, 3, 3, 1, 1, 2, 3, 1, 3, 2, 3, 3, 2...
## $ Type         <chr> "u", "u", "u", "h", "h", "t", "u", "u", "u", "h",...
## $ Method       <chr> "S", "S", "SP", "S", "S", "S", "S", "S", "SP", "S...
## $ SellerG      <chr> "Biggin", "Dingle", "Purplebricks", "Biggin", "LI...
## $ Distance     <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5,...
## $ Postcode     <int> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3...
## $ Bedroom2     <int> 1, NA, NA, NA, 2, 3, 3, NA, 1, 2, NA, 1, 3, NA, 3...
## $ Bathroom     <int> 1, NA, NA, NA, 1, 2, 2, NA, 1, 2, NA, 1, 2, NA, 2...
## $ Car          <int> 1, NA, NA, NA, 1, 2, 2, NA, 1, 2, NA, 1, 1, NA, 2...
## $ Landsize     <int> 0, NA, NA, NA, 138, 0, 4290, NA, 0, 98, NA, 0, 12...
## $ BuildingArea <int> NA, NA, NA, NA, 105, NA, 27, NA, NA, 128, NA, 50,...
## $ YearBuilt    <int> NA, NA, NA, NA, 1890, 2010, NA, NA, NA, 1920, NA,...
## $ CouncilArea  <chr> "Yarra", "", "", "", "Yarra", "Yarra", "Yarra", "...

Data Preparation

We’ll combine our two datasets so that we do not need to prepare data separately for them. However before combining them, we’ll need to add response column to test because number of columns need to be same for two datasets to stack vertically. Also we will add an identifier column to both train & test data sets so that we can separate them after our data preparation.

h_test$Price=NA

h_train$data="train"
h_test$data="test"

h=rbind(h_train,h_test)

Let us now glimpse our combined data sets ‘h’.

glimpse(h)

## Observations: 9,421
## Variables: 17
## $ Suburb       <chr> "Brunswick", "Reservoir", "Newport", "Brighton Ea...
## $ Address      <chr> "52 Evans St", "85 Radford Rd", "99 Anderson St",...
## $ Rooms        <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2, 2, 4, 3, 2...
## $ Type         <chr> "h", "h", "h", "u", "h", "h", "h", "h", "h", "u",...
## $ Price        <int> 1650000, 791000, 785000, 755000, 2500000, 3020000...
## $ Method       <chr> "S", "S", "S", "SP", "VB", "S", "VB", "VB", "PI",...
## $ SellerG      <chr> "Nelson", "Ray", "RT", "Buxton", "RT", "Hooper", ...
## $ Distance     <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9, 11.2, 12.8,...
## $ Postcode     <int> 3056, 3073, 3015, 3187, 3123, 3123, 3165, 3127, 3...
## $ Bedroom2     <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2, 2, 2, 4, N...
## $ Bathroom     <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1, 2, 1, 2, N...
## $ Car          <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1, 1, 1, 1, N...
## $ Landsize     <int> 495, 961, 185, NA, 757, 832, 710, 816, NA, 0, NA,...
## $ BuildingArea <int> 141, NA, NA, NA, 240, NA, NA, NA, NA, 80, NA, 69,...
## $ YearBuilt    <int> 1920, NA, NA, NA, 1925, NA, 1966, NA, NA, 2003, N...
## $ CouncilArea  <chr> "Moreland", "Darebin", "Hobsons Bay", "", "Boroon...
## $ data         <chr> "train", "train", "train", "train", "train", "tra...

From above I can see many missing values or NAs in my data. Let us see how many NAs are there for each variables. We will use ‘lapply’ function which will give results in list format as shown below.

lapply(h,function(x) sum(is.na(x)))

## $Suburb
## [1] 0
## 
## $Address
## [1] 0
## 
## $Rooms
## [1] 0
## 
## $Type
## [1] 0
## 
## $Price
## [1] 1885
## 
## $Method
## [1] 0
## 
## $SellerG
## [1] 0
## 
## $Distance
## [1] 0
## 
## $Postcode
## [1] 0
## 
## $Bedroom2
## [1] 1978
## 
## $Bathroom
## [1] 1978
## 
## $Car
## [1] 1978
## 
## $Landsize
## [1] 1985
## 
## $BuildingArea
## [1] 5269
## 
## $YearBuilt
## [1] 4660
## 
## $CouncilArea
## [1] 0
## 
## $data
## [1] 0

From above we can see that we have missing values in columns like Price(1885), Bedroom2 (1978), Bathroom (1978), Car(1978), Landsize(1985), BuildingArea (5269) & YearBuilt(4660).

We know that we purposely add Price column in our test data set & add NAs to that column. We need to impute these NAs with central tendency of data like mean or median. We will be doing this later as we progress.

Transforming character variables into Dummy Variables

Since categorical variables are known to hide and mask lots of interesting information in a data set.So It’s crucial we’d hunt down those categorical variables in the data set, and dig out as much information as you can. One such way is to convert the same into dummy.

Let us analyse our first categorical variable ‘Suburb’, the frequency table of which gives the following result. Basically its is nothing but categorical tabulation of data with the variable and its frequency as shown below. For example there are 52 observations which has Abbotsford as Suburb.

table(h$Suburb)

## 
##         Abbotsford         Aberfeldie       Airport West 
##                 52                 29                 65 
##        Albert Park             Albion         Alphington 
##                 44                 26                 29 
##             Altona       Altona North           Armadale 
##                 51                 71                 71 
##         Ascot Vale          Ashburton            Ashwood 
##                103                 62                 45 
##   Avondale Heights          Balaclava             Balwyn 
##                 69                 18                112 
##       Balwyn North          Bellfield          Bentleigh 
##                135                 13                127 
##     Bentleigh East           Box Hill          Braybrook 
##                241                 44                 38 
##           Brighton      Brighton East           Brooklyn 
##                150                130                 15 
##          Brunswick     Brunswick East     Brunswick West 
##                164                 63                 88 
##            Bulleen            Burnley            Burwood 
##                 66                  5                 77 
##         Camberwell      Campbellfield         Canterbury 
##                118                  6                 39 
##            Carlton      Carlton North           Carnegie 
##                 38                 32                120 
##          Caulfield     Caulfield East    Caulfield North 
##                  9                 10                 28 
##    Caulfield South          Chadstone       Clifton Hill 
##                 37                 39                 40 
##             Coburg       Coburg North        Collingwood 
##                138                 57                 48 
##           Cremorne          Docklands          Doncaster 
##                 20                  5                102 
##          Eaglemont     East Melbourne        Elsternwick 
##                 19                 19                 47 
##             Elwood           Essendon     Essendon North 
##                 96                157                 13 
##      Essendon West          Fairfield            Fawkner 
##                 21                 36                 70 
##            Fitzroy      Fitzroy North         Flemington 
##                 37                 65                 41 
##          Footscray         Gardenvale        Glen Huntly 
##                103                  6                 25 
##          Glen Iris            Glenroy          Gowanbrae 
##                152                150                 21 
##           Hadfield            Hampton       Hampton East 
##                 51                111                 36 
##           Hawthorn      Hawthorn East         Heidelberg 
##                132                 90                 36 
## Heidelberg Heights    Heidelberg West         Hughesdale 
##                 58                 55                 30 
##            Ivanhoe       Ivanhoe East             Jacana 
##                 81                 19                 17 
##             Kealba        Keilor East        Keilor Park 
##                 17                 89                 16 
##         Kensington                Kew           Kew East 
##                 92                148                 41 
##          Kingsbury         Kingsville            Kooyong 
##                 13                 27                  3 
##          Maidstone            Malvern       Malvern East 
##                 64                 58                114 
##        Maribyrnong          Melbourne        Middle Park 
##                100                 78                 20 
##        Mont Albert       Moonee Ponds          Moorabbin 
##                 35                107                 59 
##        Murrumbeena            Newport            Niddrie 
##                 57                101                 63 
##    North Melbourne          Northcote           Oak Park 
##                 55                145                 49 
##           Oakleigh     Oakleigh South             Ormond 
##                 36                 55                 67 
##          Parkville        Pascoe Vale     Port Melbourne 
##                 28                129                126 
##            Prahran            Preston       Princes Hill 
##                 90                189                  3 
##          Reservoir           Richmond          Ripponlea 
##                337                215                  9 
##            Rosanna           Seaholme             Seddon 
##                 62                  7                 44 
##   South Kingsville    South Melbourne        South Yarra 
##                 16                 59                164 
##          Southbank          Spotswood           St Kilda 
##                 32                 24                169 
##         Strathmore Strathmore Heights           Sunshine 
##                 67                  9                 92 
##     Sunshine North      Sunshine West       Surrey Hills 
##                 68                 80                 90 
##  Templestowe Lower          Thornbury             Toorak 
##                 86                107                 92 
##         Travancore           Viewbank           Watsonia 
##                  6                 34                 35 
##     West Footscray     West Melbourne       Williamstown 
##                 68                 20                 77 
## Williamstown North            Windsor          Yallambie 
##                 14                 45                 25 
##         Yarraville 
##                111

Writing a Function to create Dummy

We will write a function which will be applied to all categorical variables so as to convert them into dummies.

CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]
for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
name=gsub("\\/","_",name)
name=gsub(">","GT_",name)
name=gsub("=","EQ_",name)
name=gsub(",","",name)
data[,name]=as.numeric(data[,var]==cat)
}
data[,var]=NULL
return(data)
}

Let me explain the function ‘CreateDummies’ we just created

t=table(data[,var]) this bit creates a frequency table for the given categorical column. t here is now simply a table which contains names as categories of the categorical variable and their frequency in the data.

t=t[t>freq_cutoff] this line of code removes those categories from the table which have frequencies below the frequency cutoff. ( this is a subjective choice)

‘t=sort(t)’ this line simple sorts the remaining table in ascending order

categories=names(t)[-1] since we sorted the table in ascending manner in the previous line, first category here has least count. In this line of code we are taking out the category names except the first one ( which has least count), thus making n-1 dummies from the remaining categories.

name=paste(var,cat,sep=“_“) all the dummy vars that we intend to make, need to have some name. this line of code creates that name by concatenating variable name with category name with an _.

name=gsub(" “,”“,name) subsequent lines like these using gsub are essentially cleaning up the name. Since we dont have any control over what the categories can be, we are removing special characters and spaces in the code in an automated fashion.

data[,name]=as.numeric(data[,var]==cat) once we have a cleaned up name, this line creates the dummy var for that particular category.

data[,var]=NULL once we are done creating dummies for the variable using for loop. Variable is removed from the data in this line.

Let us have a look at our categorical variables by writing following lines of codes

names(h)[sapply(h,function(x) is.character(x))]

## [1] "Suburb"      "Address"     "Type"        "Method"      "SellerG"    
## [6] "CouncilArea" "data"

Now we will check for High-Cardinality in the categorical variables i.e we will check for variables with many distinct values. We will discard those variables from our modelling. Because including these attributes by standard dummy encoding increases the dimensionality of the data to such an extent that either the classification technique is unable to process them or if one would use some regularized linear technique that is able to cope with huge dimensions, it leads to a model with thousands or even millions of features, thereby losing the often required comprehensibility aspect.

Checking High-Cardinality in our categorical variables

length(unique(h$Suburb))

## [1] 142

length(unique(h$Address))

## [1] 9324

length(unique(h$Type))

## [1] 3

length(unique(h$SellerG))

## [1] 198

length(unique(h$CouncilArea))

## [1] 20

We will ignore variable ‘Address’ for their High-Cardinality as shown above. Further we will ignore variable data for obvious reason.

h=h %>% select(-Address)

Let us make dummies for rest of the variables by using for loop.

cat_cols=c("Suburb","Type","Method","SellerG","CouncilArea")

for (cat in cat_cols){
  h=CreateDummies(h,cat,100)
}

Now This dummy creation has increased number of variables to 85 as we can glimpse the same as shown below.

glimpse(h)

## Observations: 9,421
## Variables: 85
## $ Rooms                    <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2...
## $ Price                    <int> 1650000, 791000, 785000, 755000, 2500...
## $ Distance                 <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9,...
## $ Postcode                 <int> 3056, 3073, 3015, 3187, 3123, 3123, 3...
## $ Bedroom2                 <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2...
## $ Bathroom                 <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1...
## $ Car                      <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1...
## $ Landsize                 <int> 495, 961, 185, NA, 757, 832, 710, 816...
## $ BuildingArea             <int> 141, NA, NA, NA, 240, NA, NA, NA, NA,...
## $ YearBuilt                <int> 1920, NA, NA, NA, 1925, NA, 1966, NA,...
## $ data                     <chr> "train", "train", "train", "train", "...
## $ Suburb_Doncaster         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_AscotVale         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Footscray         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MooneePonds       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Thornbury         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hampton           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Yarraville        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Balwyn            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MalvernEast       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Camberwell        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Carnegie          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ Suburb_PortMelbourne     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Bentleigh         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PascoeVale        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BrightonEast      <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hawthorn          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BalwynNorth       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Coburg            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Northcote         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Kew               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brighton          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Glenroy           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_GlenIris          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Essendon          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brunswick         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_SouthYarra        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Suburb_StKilda           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Preston           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Richmond          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BentleighEast     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ Suburb_Reservoir         <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Type_u                   <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1...
## $ Type_h                   <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ Method_SP                <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Method_PI                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0...
## $ Method_S                 <dbl> 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1...
## $ SellerG_Kay              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Hodges           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_McGrath          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Noel             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Gary             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jas              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Miles            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Greg             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Sweeney          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_RT               <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Fletchers        <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ SellerG_Woodards         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Brad             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Biggin           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ SellerG_Ray              <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Buxton           <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Marshall         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ SellerG_Barry            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_hockingstuart    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jellis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SellerG_Nelson           <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ CouncilArea_Whitehorse   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ CouncilArea_Manningham   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Brimbank     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_HobsonsBay   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Bayside      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Melbourne    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Banyule      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_PortPhillip  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Yarra        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Maribyrnong  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Stonnington  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ CouncilArea_GlenEira     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ CouncilArea_Darebin      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_MooneeValley <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Moreland     <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Boroondara   <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0...
## $ CouncilArea_             <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0...

Imputing Missing Values (NAs)

We have seen earlier that our data has missing values & we need to impute it with mean of train data.Also we will exclude such imputations in variables data & Price. For loop demonstrate the same.

for(col in names(h)){
if(sum(is.na(h[,col]))>0 & !(col %in% c("data","Price"))){
h[is.na(h[,col]),col]=mean(h[h$data=='train',col],na.rm=T)
}
}

Let us see if there are still any NAs in our data using lapply function again.

lapply(h,function(x) sum(is.na(x)))

## $Rooms
## [1] 0
## 
## $Price
## [1] 1885
## 
## $Distance
## [1] 0
## 
## $Postcode
## [1] 0
## 
## $Bedroom2
## [1] 0
## 
## $Bathroom
## [1] 0
## 
## $Car
## [1] 0
## 
## $Landsize
## [1] 0
## 
## $BuildingArea
## [1] 0
## 
## $YearBuilt
## [1] 0
## 
## $data
## [1] 0
## 
## $Suburb_Doncaster
## [1] 0
## 
## $Suburb_AscotVale
## [1] 0
## 
## $Suburb_Footscray
## [1] 0
## 
## $Suburb_MooneePonds
## [1] 0
## 
## $Suburb_Thornbury
## [1] 0
## 
## $Suburb_Hampton
## [1] 0
## 
## $Suburb_Yarraville
## [1] 0
## 
## $Suburb_Balwyn
## [1] 0
## 
## $Suburb_MalvernEast
## [1] 0
## 
## $Suburb_Camberwell
## [1] 0
## 
## $Suburb_Carnegie
## [1] 0
## 
## $Suburb_PortMelbourne
## [1] 0
## 
## $Suburb_Bentleigh
## [1] 0
## 
## $Suburb_PascoeVale
## [1] 0
## 
## $Suburb_BrightonEast
## [1] 0
## 
## $Suburb_Hawthorn
## [1] 0
## 
## $Suburb_BalwynNorth
## [1] 0
## 
## $Suburb_Coburg
## [1] 0
## 
## $Suburb_Northcote
## [1] 0
## 
## $Suburb_Kew
## [1] 0
## 
## $Suburb_Brighton
## [1] 0
## 
## $Suburb_Glenroy
## [1] 0
## 
## $Suburb_GlenIris
## [1] 0
## 
## $Suburb_Essendon
## [1] 0
## 
## $Suburb_Brunswick
## [1] 0
## 
## $Suburb_SouthYarra
## [1] 0
## 
## $Suburb_StKilda
## [1] 0
## 
## $Suburb_Preston
## [1] 0
## 
## $Suburb_Richmond
## [1] 0
## 
## $Suburb_BentleighEast
## [1] 0
## 
## $Suburb_Reservoir
## [1] 0
## 
## $Type_u
## [1] 0
## 
## $Type_h
## [1] 0
## 
## $Method_SP
## [1] 0
## 
## $Method_PI
## [1] 0
## 
## $Method_S
## [1] 0
## 
## $SellerG_Kay
## [1] 0
## 
## $SellerG_Hodges
## [1] 0
## 
## $SellerG_McGrath
## [1] 0
## 
## $SellerG_Noel
## [1] 0
## 
## $SellerG_Gary
## [1] 0
## 
## $SellerG_Jas
## [1] 0
## 
## $SellerG_Miles
## [1] 0
## 
## $SellerG_Greg
## [1] 0
## 
## $SellerG_Sweeney
## [1] 0
## 
## $SellerG_RT
## [1] 0
## 
## $SellerG_Fletchers
## [1] 0
## 
## $SellerG_Woodards
## [1] 0
## 
## $SellerG_Brad
## [1] 0
## 
## $SellerG_Biggin
## [1] 0
## 
## $SellerG_Ray
## [1] 0
## 
## $SellerG_Buxton
## [1] 0
## 
## $SellerG_Marshall
## [1] 0
## 
## $SellerG_Barry
## [1] 0
## 
## $SellerG_hockingstuart
## [1] 0
## 
## $SellerG_Jellis
## [1] 0
## 
## $SellerG_Nelson
## [1] 0
## 
## $CouncilArea_Whitehorse
## [1] 0
## 
## $CouncilArea_Manningham
## [1] 0
## 
## $CouncilArea_Brimbank
## [1] 0
## 
## $CouncilArea_HobsonsBay
## [1] 0
## 
## $CouncilArea_Bayside
## [1] 0
## 
## $CouncilArea_Melbourne
## [1] 0
## 
## $CouncilArea_Banyule
## [1] 0
## 
## $CouncilArea_PortPhillip
## [1] 0
## 
## $CouncilArea_Yarra
## [1] 0
## 
## $CouncilArea_Maribyrnong
## [1] 0
## 
## $CouncilArea_Stonnington
## [1] 0
## 
## $CouncilArea_GlenEira
## [1] 0
## 
## $CouncilArea_Darebin
## [1] 0
## 
## $CouncilArea_MooneeValley
## [1] 0
## 
## $CouncilArea_Moreland
## [1] 0
## 
## $CouncilArea_Boroondara
## [1] 0
## 
## $CouncilArea_
## [1] 0

Data Separation

Now we are done with data preparation , lets separate the data now.We will filter it by train & test & remove data column from train data set & remove data & target variable Price from test data set.

h_train=h %>% filter(data=="train") %>% select(-data)
h_test=h %>% filter(data=="test") %>% select(-data,-Price)

Data Sampling

Next we will break our train data into 2 parts in ratio 70:30. We will build model on one part & check its performance on the other.

s=sample(1:nrow(h_train),0.7*nrow(h_train))

h_train1=h_train[s,]
h_train2=h_train[-s,]

Let us glimpse our data again to check if we need to convert any data type.

glimpse(h_train1)

## Observations: 5,275
## Variables: 84
## $ Rooms                    <int> 5, 3, 5, 3, 3, 1, 3, 4, 3, 6, 5, 3, 3...
## $ Price                    <int> 2010000, 710000, 1035000, 465500, 134...
## $ Distance                 <dbl> 7.8, 9.4, 13.5, 15.0, 13.0, 6.4, 13.5...
## $ Postcode                 <int> 3124, 3081, 3042, 3021, 3166, 3011, 3...
## $ Bedroom2                 <dbl> 2.78618, 3.00000, 2.78618, 2.78618, 3...
## $ Bathroom                 <dbl> 1.499247, 2.000000, 1.499247, 1.49924...
## $ Car                      <dbl> 1.510624, 4.000000, 1.510624, 1.51062...
## $ Landsize                 <dbl> 452.4478, 1.0000, 452.4478, 452.4478,...
## $ BuildingArea             <dbl> 143.0472, 143.0472, 143.0472, 143.047...
## $ YearBuilt                <dbl> 1961.046, 1961.046, 1961.046, 1961.04...
## $ Suburb_Doncaster         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_AscotVale         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Footscray         <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MooneePonds       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Thornbury         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hampton           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Yarraville        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Suburb_Balwyn            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MalvernEast       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Camberwell        <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Carnegie          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PortMelbourne     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Bentleigh         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PascoeVale        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BrightonEast      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hawthorn          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BalwynNorth       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0...
## $ Suburb_Coburg            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Northcote         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Kew               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brighton          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Glenroy           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ Suburb_GlenIris          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Essendon          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brunswick         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_SouthYarra        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_StKilda           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Preston           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Richmond          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BentleighEast     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Reservoir         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ Type_u                   <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ Type_h                   <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1...
## $ Method_SP                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Method_PI                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Method_S                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1...
## $ SellerG_Kay              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Hodges           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_McGrath          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SellerG_Noel             <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ SellerG_Gary             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jas              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ SellerG_Miles            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Greg             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Sweeney          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_RT               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Fletchers        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ SellerG_Woodards         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Brad             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Biggin           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Ray              <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Buxton           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Marshall         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Barry            <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_hockingstuart    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jellis           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Nelson           <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0...
## $ CouncilArea_Whitehorse   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Manningham   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Brimbank     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_HobsonsBay   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Bayside      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Melbourne    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Banyule      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_PortPhillip  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Yarra        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Maribyrnong  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1...
## $ CouncilArea_Stonnington  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_GlenEira     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Darebin      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_MooneeValley <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Moreland     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Boroondara   <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ CouncilArea_             <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0...

Regression Model Building with Random Forest

As we know that RandomForest has 4 parameters- mtry,ntree,maxnodes & nodessize.Let us set the parameter values that we want to try out.

param=list(mtry=c(5,10,20,30),
           ntree=c(50,100,200,500),
           maxnodes=c(5,10,15,20),
           nodesize=c(1,2,5,10))

The above leads to 444*4 i.e 256 possible combinations. We should technically CV performance of all these combination to find which one is the best. But that might be an overkill in terms of how much time it might take.If we are looking to do a 10 fold cross validation it will lead to 256X10 = 2560 randomforest models being built. If on an average each randomforest had 100 Trees, we are looking at 256,000 decision trees being built internally. This is beyond resource consuming. At the same time we need to realise that we dont really need to try out all the possible combination.Instead we can randomly select a much smaller subset to try.

Lets write a function which selects a random subset of combination.

subset_paras=function(full_list_para,n=10){
  all_comb=expand.grid(full_list_para)
  s=sample(1:nrow(all_comb),n)
  subset_para=all_comb[s,]
  return(subset_para)
}

If we pass the list of parameter values (full_list_para ) that we want to try and specify number of combinations ( n ) we want to try; it randomly selects those many combinations out of all possible parameter combinations and returns them in form of a dataframe.

We will take num_trials as 40 which is around 15% of total possible combinations (4^4 i.e 256). since a good value for num_trials is around 10-20% of total possible combination.

num_trials=40
my_params=subset_paras(param,num_trials)

We’ll be using cvTuning function from package cvTools to try out all these paramter combinations one by one. We’ll compare the cv error measure of all these and pick that as the best combination which results in lowest error.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(cvTools)

## Loading required package: lattice

## Loading required package: robustbase

In the below line of codes every time we find a parameter combination which has lowest error so far, it will be printed in the output.

myerror=9999999

for(i in 1:num_trials){
  print(paste0('starting iteration:',i))
 
  params=my_params[i,]
  k=cvTuning(randomForest,Price~.,
             data =h_train,
             tuning =params,
             folds = cvFolds(nrow(h_train), K=10, type = "random"),
             seed =2
  )
  score.this=k$cv[,2]
  if(score.this<myerror){
    print(params)
    
    myerror=score.this
    print(myerror)
    
    best_params=params
  }
   print('DONE')
  
}

## [1] "starting iteration:1"
##    mtry ntree maxnodes nodesize
## 49    5    50       20        1
## [1] 478453
## [1] "DONE"
## [1] "starting iteration:2"
## [1] "DONE"
## [1] "starting iteration:3"
## [1] "DONE"
## [1] "starting iteration:4"
##     mtry ntree maxnodes nodesize
## 227   20    50       15       10
## [1] 426611
## [1] "DONE"
## [1] "starting iteration:5"
## [1] "DONE"
## [1] "starting iteration:6"
## [1] "DONE"
## [1] "starting iteration:7"
## [1] "DONE"
## [1] "starting iteration:8"
##     mtry ntree maxnodes nodesize
## 256   30   500       20       10
## [1] 404771
## [1] "DONE"
## [1] "starting iteration:9"
## [1] "DONE"
## [1] "starting iteration:10"
## [1] "DONE"
## [1] "starting iteration:11"
## [1] "DONE"
## [1] "starting iteration:12"
## [1] "DONE"
## [1] "starting iteration:13"
## [1] "DONE"
## [1] "starting iteration:14"
## [1] "DONE"
## [1] "starting iteration:15"
## [1] "DONE"
## [1] "starting iteration:16"
## [1] "DONE"
## [1] "starting iteration:17"
## [1] "DONE"
## [1] "starting iteration:18"
## [1] "DONE"
## [1] "starting iteration:19"
## [1] "DONE"
## [1] "starting iteration:20"
## [1] "DONE"
## [1] "starting iteration:21"
## [1] "DONE"
## [1] "starting iteration:22"
## [1] "DONE"
## [1] "starting iteration:23"
## [1] "DONE"
## [1] "starting iteration:24"
## [1] "DONE"
## [1] "starting iteration:25"
## [1] "DONE"
## [1] "starting iteration:26"
## [1] "DONE"
## [1] "starting iteration:27"
## [1] "DONE"
## [1] "starting iteration:28"
## [1] "DONE"
## [1] "starting iteration:29"
## [1] "DONE"
## [1] "starting iteration:30"
## [1] "DONE"
## [1] "starting iteration:31"
## [1] "DONE"
## [1] "starting iteration:32"
## [1] "DONE"
## [1] "starting iteration:33"
## [1] "DONE"
## [1] "starting iteration:34"
## [1] "DONE"
## [1] "starting iteration:35"
## [1] "DONE"
## [1] "starting iteration:36"
## [1] "DONE"
## [1] "starting iteration:37"
## [1] "DONE"
## [1] "starting iteration:38"
## [1] "DONE"
## [1] "starting iteration:39"
## [1] "DONE"
## [1] "starting iteration:40"
## [1] "DONE"

To know tentative performance measure , that’ll be latest value of myerror

myerror

## [1] 404771

Our best_params comes out to be

best_params

##     mtry ntree maxnodes nodesize
## 256   30   500       20       10

Model Building

Now we have best values of paramteres from cross validation. We’ll use these values to build our RandomForest model on entire training data and use that for prediction on test.

h.rf.final=randomForest(Price~.,
                         mtry=best_params$mtry,
                         ntree=best_params$ntree,
                         maxnodes=best_params$maxnodes,
                         nodesize=best_params$nodesize,
                         data=h_train)

Prediction on Test data set

test.pred=predict(h.rf.final,newdata = h_test)

In order to see first few records of predicted property Price for the test data set we need to use head function as under.

head(test.pred)

##         1         2         3         4         5         6 
##  452407.0  617743.2  555290.8 1293809.2 1068916.5  961300.0

Real Estate

Rajib Achari

22 September 2019