Price of a property is one of the most important decision criterion when people buy homes. Real state firms need to be consistent in their pricing in order to attract buyers . Having a predictive model for the same will be great tool to have , which in turn can also be used to tweak development of properties , putting more emphasis on qualities which increase the value of the property.
We have been given two datasets , housing_train.csv and housing_test.csv . We need to use data housing_train to build predictive model for response variable “Price”. Housing_test data contains all other factors except “Price”, we need to predict that using the model that we developed.
setwd("D:/Edvancer/R Tutorials/R Projects Codes/P1-Real Estate")
h_train=read.csv("housing_train.csv",stringsAsFactors = F)
h_test=read.csv("housing_test.csv",stringsAsFactors = F)
Each row represnts charactaristic of a single property . Many categorical data has been coded to mask the data
Suburb : categorical :: Which subsurb the property is located in
Address : categorical :: short address
Rooms : numeric :: Number of Rooms
Type : categorical :: type of the property
Price : numeric :: This is the target variable, price of the property
Method : categorical :: method for selling
SellerG : categorical :: Name of the seller
Distance : numeric :: distance from the city center Postcode : categorical :: postcode of the property
Bedroom2 : Numeric :: numbers of secondary bedrooms (this is different from rooms)
Bathroom : numeric :: number of bathrooms
Car : numeric :: number of parking spaces
Landsize : numeric :: landsize
BuildingArea : numeric :: buildup area
YearBuilt : numeric :: year of building
CouncilArea : numeric :: council area to which the propery belongs
Let us see first few records of our train & test data using ‘head’ function as shown below. It is evident that our test data contains every fields except response variable ‘Price’.
head(h_train)
## Suburb Address Rooms Type Price Method SellerG Distance
## 1 Brunswick 52 Evans St 3 h 1650000 S Nelson 5.2
## 2 Reservoir 85 Radford Rd 5 h 791000 S Ray 11.2
## 3 Newport 99 Anderson St 3 h 785000 S RT 8.4
## 4 Brighton East 4/377 South Rd 2 u 755000 SP Buxton 10.7
## 5 Hawthorn East 3 Jaques St 5 h 2500000 VB RT 7.5
## 6 Hawthorn East 75 Leura Gr 3 h 3020000 S Hooper 7.5
## Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1 3056 3 1 2 495 141 1920
## 2 3073 4 3 1 961 NA NA
## 3 3015 3 1 1 185 NA NA
## 4 3187 NA NA NA NA NA NA
## 5 3123 5 3 3 757 240 1925
## 6 3123 3 2 2 832 NA NA
## CouncilArea
## 1 Moreland
## 2 Darebin
## 3 Hobsons Bay
## 4
## 5 Boroondara
## 6 Boroondara
head(h_test)
## Suburb Address Rooms Type Method SellerG Distance
## 1 Abbotsford 6/241 Nicholson St 1 u S Biggin 2.5
## 2 Abbotsford 403/609 Victoria St 2 u S Dingle 2.5
## 3 Abbotsford 106/119 Turner St 1 u SP Purplebricks 2.5
## 4 Abbotsford 22 Park St 4 h S Biggin 2.5
## 5 Abbotsford 78 Yarra St 3 h S LITTLE 2.5
## 6 Abbotsford 13/11 Nicholson St 3 t S Beller 2.5
## Postcode Bedroom2 Bathroom Car Landsize BuildingArea YearBuilt
## 1 3067 1 1 1 0 NA NA
## 2 3067 NA NA NA NA NA NA
## 3 3067 NA NA NA NA NA NA
## 4 3067 NA NA NA NA NA NA
## 5 3067 2 1 1 138 105 1890
## 6 3067 3 2 2 0 NA 2010
## CouncilArea
## 1 Yarra
## 2
## 3
## 4
## 5 Yarra
## 6 Yarra
In order to glimpse data & its data type we need to use function glimpse from the package ‘dplyr’.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(h_train)
## Observations: 7,536
## Variables: 16
## $ Suburb <chr> "Brunswick", "Reservoir", "Newport", "Brighton Ea...
## $ Address <chr> "52 Evans St", "85 Radford Rd", "99 Anderson St",...
## $ Rooms <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2, 2, 4, 3, 2...
## $ Type <chr> "h", "h", "h", "u", "h", "h", "h", "h", "h", "u",...
## $ Price <int> 1650000, 791000, 785000, 755000, 2500000, 3020000...
## $ Method <chr> "S", "S", "S", "SP", "VB", "S", "VB", "VB", "PI",...
## $ SellerG <chr> "Nelson", "Ray", "RT", "Buxton", "RT", "Hooper", ...
## $ Distance <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9, 11.2, 12.8,...
## $ Postcode <int> 3056, 3073, 3015, 3187, 3123, 3123, 3165, 3127, 3...
## $ Bedroom2 <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2, 2, 2, 4, N...
## $ Bathroom <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1, 2, 1, 2, N...
## $ Car <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1, 1, 1, 1, N...
## $ Landsize <int> 495, 961, 185, NA, 757, 832, 710, 816, NA, 0, NA,...
## $ BuildingArea <int> 141, NA, NA, NA, 240, NA, NA, NA, NA, 80, NA, 69,...
## $ YearBuilt <int> 1920, NA, NA, NA, 1925, NA, 1966, NA, NA, 2003, N...
## $ CouncilArea <chr> "Moreland", "Darebin", "Hobsons Bay", "", "Boroon...
glimpse(h_test)
## Observations: 1,885
## Variables: 15
## $ Suburb <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbots...
## $ Address <chr> "6/241 Nicholson St", "403/609 Victoria St", "106...
## $ Rooms <int> 1, 2, 1, 4, 3, 3, 3, 1, 1, 2, 3, 1, 3, 2, 3, 3, 2...
## $ Type <chr> "u", "u", "u", "h", "h", "t", "u", "u", "u", "h",...
## $ Method <chr> "S", "S", "SP", "S", "S", "S", "S", "S", "SP", "S...
## $ SellerG <chr> "Biggin", "Dingle", "Purplebricks", "Biggin", "LI...
## $ Distance <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5,...
## $ Postcode <int> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3...
## $ Bedroom2 <int> 1, NA, NA, NA, 2, 3, 3, NA, 1, 2, NA, 1, 3, NA, 3...
## $ Bathroom <int> 1, NA, NA, NA, 1, 2, 2, NA, 1, 2, NA, 1, 2, NA, 2...
## $ Car <int> 1, NA, NA, NA, 1, 2, 2, NA, 1, 2, NA, 1, 1, NA, 2...
## $ Landsize <int> 0, NA, NA, NA, 138, 0, 4290, NA, 0, 98, NA, 0, 12...
## $ BuildingArea <int> NA, NA, NA, NA, 105, NA, 27, NA, NA, 128, NA, 50,...
## $ YearBuilt <int> NA, NA, NA, NA, 1890, 2010, NA, NA, NA, 1920, NA,...
## $ CouncilArea <chr> "Yarra", "", "", "", "Yarra", "Yarra", "Yarra", "...
We’ll combine our two datasets so that we do not need to prepare data separately for them. However before combining them, we’ll need to add response column to test because number of columns need to be same for two datasets to stack vertically. Also we will add an identifier column to both train & test data sets so that we can separate them after our data preparation.
h_test$Price=NA
h_train$data="train"
h_test$data="test"
h=rbind(h_train,h_test)
Let us now glimpse our combined data sets ‘h’.
glimpse(h)
## Observations: 9,421
## Variables: 17
## $ Suburb <chr> "Brunswick", "Reservoir", "Newport", "Brighton Ea...
## $ Address <chr> "52 Evans St", "85 Radford Rd", "99 Anderson St",...
## $ Rooms <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2, 2, 4, 3, 2...
## $ Type <chr> "h", "h", "h", "u", "h", "h", "h", "h", "h", "u",...
## $ Price <int> 1650000, 791000, 785000, 755000, 2500000, 3020000...
## $ Method <chr> "S", "S", "S", "SP", "VB", "S", "VB", "VB", "PI",...
## $ SellerG <chr> "Nelson", "Ray", "RT", "Buxton", "RT", "Hooper", ...
## $ Distance <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9, 11.2, 12.8,...
## $ Postcode <int> 3056, 3073, 3015, 3187, 3123, 3123, 3165, 3127, 3...
## $ Bedroom2 <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2, 2, 2, 4, N...
## $ Bathroom <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1, 2, 1, 2, N...
## $ Car <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1, 1, 1, 1, N...
## $ Landsize <int> 495, 961, 185, NA, 757, 832, 710, 816, NA, 0, NA,...
## $ BuildingArea <int> 141, NA, NA, NA, 240, NA, NA, NA, NA, 80, NA, 69,...
## $ YearBuilt <int> 1920, NA, NA, NA, 1925, NA, 1966, NA, NA, 2003, N...
## $ CouncilArea <chr> "Moreland", "Darebin", "Hobsons Bay", "", "Boroon...
## $ data <chr> "train", "train", "train", "train", "train", "tra...
From above I can see many missing values or NAs in my data. Let us see how many NAs are there for each variables. We will use ‘lapply’ function which will give results in list format as shown below.
lapply(h,function(x) sum(is.na(x)))
## $Suburb
## [1] 0
##
## $Address
## [1] 0
##
## $Rooms
## [1] 0
##
## $Type
## [1] 0
##
## $Price
## [1] 1885
##
## $Method
## [1] 0
##
## $SellerG
## [1] 0
##
## $Distance
## [1] 0
##
## $Postcode
## [1] 0
##
## $Bedroom2
## [1] 1978
##
## $Bathroom
## [1] 1978
##
## $Car
## [1] 1978
##
## $Landsize
## [1] 1985
##
## $BuildingArea
## [1] 5269
##
## $YearBuilt
## [1] 4660
##
## $CouncilArea
## [1] 0
##
## $data
## [1] 0
From above we can see that we have missing values in columns like Price(1885), Bedroom2 (1978), Bathroom (1978), Car(1978), Landsize(1985), BuildingArea (5269) & YearBuilt(4660).
We know that we purposely add Price column in our test data set & add NAs to that column. We need to impute these NAs with central tendency of data like mean or median. We will be doing this later as we progress.
Since categorical variables are known to hide and mask lots of interesting information in a data set.So It’s crucial we’d hunt down those categorical variables in the data set, and dig out as much information as you can. One such way is to convert the same into dummy.
Let us analyse our first categorical variable ‘Suburb’, the frequency table of which gives the following result. Basically its is nothing but categorical tabulation of data with the variable and its frequency as shown below. For example there are 52 observations which has Abbotsford as Suburb.
table(h$Suburb)
##
## Abbotsford Aberfeldie Airport West
## 52 29 65
## Albert Park Albion Alphington
## 44 26 29
## Altona Altona North Armadale
## 51 71 71
## Ascot Vale Ashburton Ashwood
## 103 62 45
## Avondale Heights Balaclava Balwyn
## 69 18 112
## Balwyn North Bellfield Bentleigh
## 135 13 127
## Bentleigh East Box Hill Braybrook
## 241 44 38
## Brighton Brighton East Brooklyn
## 150 130 15
## Brunswick Brunswick East Brunswick West
## 164 63 88
## Bulleen Burnley Burwood
## 66 5 77
## Camberwell Campbellfield Canterbury
## 118 6 39
## Carlton Carlton North Carnegie
## 38 32 120
## Caulfield Caulfield East Caulfield North
## 9 10 28
## Caulfield South Chadstone Clifton Hill
## 37 39 40
## Coburg Coburg North Collingwood
## 138 57 48
## Cremorne Docklands Doncaster
## 20 5 102
## Eaglemont East Melbourne Elsternwick
## 19 19 47
## Elwood Essendon Essendon North
## 96 157 13
## Essendon West Fairfield Fawkner
## 21 36 70
## Fitzroy Fitzroy North Flemington
## 37 65 41
## Footscray Gardenvale Glen Huntly
## 103 6 25
## Glen Iris Glenroy Gowanbrae
## 152 150 21
## Hadfield Hampton Hampton East
## 51 111 36
## Hawthorn Hawthorn East Heidelberg
## 132 90 36
## Heidelberg Heights Heidelberg West Hughesdale
## 58 55 30
## Ivanhoe Ivanhoe East Jacana
## 81 19 17
## Kealba Keilor East Keilor Park
## 17 89 16
## Kensington Kew Kew East
## 92 148 41
## Kingsbury Kingsville Kooyong
## 13 27 3
## Maidstone Malvern Malvern East
## 64 58 114
## Maribyrnong Melbourne Middle Park
## 100 78 20
## Mont Albert Moonee Ponds Moorabbin
## 35 107 59
## Murrumbeena Newport Niddrie
## 57 101 63
## North Melbourne Northcote Oak Park
## 55 145 49
## Oakleigh Oakleigh South Ormond
## 36 55 67
## Parkville Pascoe Vale Port Melbourne
## 28 129 126
## Prahran Preston Princes Hill
## 90 189 3
## Reservoir Richmond Ripponlea
## 337 215 9
## Rosanna Seaholme Seddon
## 62 7 44
## South Kingsville South Melbourne South Yarra
## 16 59 164
## Southbank Spotswood St Kilda
## 32 24 169
## Strathmore Strathmore Heights Sunshine
## 67 9 92
## Sunshine North Sunshine West Surrey Hills
## 68 80 90
## Templestowe Lower Thornbury Toorak
## 86 107 92
## Travancore Viewbank Watsonia
## 6 34 35
## West Footscray West Melbourne Williamstown
## 68 20 77
## Williamstown North Windsor Yallambie
## 14 45 25
## Yarraville
## 111
We will write a function which will be applied to all categorical variables so as to convert them into dummies.
CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]
for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
name=gsub("\\/","_",name)
name=gsub(">","GT_",name)
name=gsub("=","EQ_",name)
name=gsub(",","",name)
data[,name]=as.numeric(data[,var]==cat)
}
data[,var]=NULL
return(data)
}
Let me explain the function ‘CreateDummies’ we just created
t=table(data[,var]) this bit creates a frequency table for the given categorical column. t here is now simply a table which contains names as categories of the categorical variable and their frequency in the data.
t=t[t>freq_cutoff] this line of code removes those categories from the table which have frequencies below the frequency cutoff. ( this is a subjective choice)
‘t=sort(t)’ this line simple sorts the remaining table in ascending order
categories=names(t)[-1] since we sorted the table in ascending manner in the previous line, first category here has least count. In this line of code we are taking out the category names except the first one ( which has least count), thus making n-1 dummies from the remaining categories.
name=paste(var,cat,sep=“_“) all the dummy vars that we intend to make, need to have some name. this line of code creates that name by concatenating variable name with category name with an _.
name=gsub(" “,”“,name) subsequent lines like these using gsub are essentially cleaning up the name. Since we dont have any control over what the categories can be, we are removing special characters and spaces in the code in an automated fashion.
data[,name]=as.numeric(data[,var]==cat) once we have a cleaned up name, this line creates the dummy var for that particular category.
data[,var]=NULL once we are done creating dummies for the variable using for loop. Variable is removed from the data in this line.
Let us have a look at our categorical variables by writing following lines of codes
names(h)[sapply(h,function(x) is.character(x))]
## [1] "Suburb" "Address" "Type" "Method" "SellerG"
## [6] "CouncilArea" "data"
Now we will check for High-Cardinality in the categorical variables i.e we will check for variables with many distinct values. We will discard those variables from our modelling. Because including these attributes by standard dummy encoding increases the dimensionality of the data to such an extent that either the classification technique is unable to process them or if one would use some regularized linear technique that is able to cope with huge dimensions, it leads to a model with thousands or even millions of features, thereby losing the often required comprehensibility aspect.
length(unique(h$Suburb))
## [1] 142
length(unique(h$Address))
## [1] 9324
length(unique(h$Type))
## [1] 3
length(unique(h$SellerG))
## [1] 198
length(unique(h$CouncilArea))
## [1] 20
We will ignore variable ‘Address’ for their High-Cardinality as shown above. Further we will ignore variable data for obvious reason.
h=h %>% select(-Address)
Let us make dummies for rest of the variables by using for loop.
cat_cols=c("Suburb","Type","Method","SellerG","CouncilArea")
for (cat in cat_cols){
h=CreateDummies(h,cat,100)
}
Now This dummy creation has increased number of variables to 85 as we can glimpse the same as shown below.
glimpse(h)
## Observations: 9,421
## Variables: 85
## $ Rooms <int> 3, 5, 3, 2, 5, 3, 3, 3, 4, 2, 3, 2, 2...
## $ Price <int> 1650000, 791000, 785000, 755000, 2500...
## $ Distance <dbl> 5.2, 11.2, 8.4, 10.7, 7.5, 7.5, 13.9,...
## $ Postcode <int> 3056, 3073, 3015, 3187, 3123, 3123, 3...
## $ Bedroom2 <int> 3, 4, 3, NA, 5, 3, 3, 3, NA, 2, NA, 2...
## $ Bathroom <int> 1, 3, 1, NA, 3, 2, 1, 2, NA, 2, NA, 1...
## $ Car <int> 2, 1, 1, NA, 3, 2, 1, 4, NA, 2, NA, 1...
## $ Landsize <int> 495, 961, 185, NA, 757, 832, 710, 816...
## $ BuildingArea <int> 141, NA, NA, NA, 240, NA, NA, NA, NA,...
## $ YearBuilt <int> 1920, NA, NA, NA, 1925, NA, 1966, NA,...
## $ data <chr> "train", "train", "train", "train", "...
## $ Suburb_Doncaster <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_AscotVale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Footscray <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MooneePonds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Thornbury <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hampton <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Yarraville <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Balwyn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MalvernEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Camberwell <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Carnegie <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ Suburb_PortMelbourne <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Bentleigh <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PascoeVale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BrightonEast <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hawthorn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BalwynNorth <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Coburg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Northcote <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Kew <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brighton <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Glenroy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_GlenIris <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Essendon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brunswick <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_SouthYarra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Suburb_StKilda <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Preston <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Richmond <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BentleighEast <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ Suburb_Reservoir <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Type_u <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1...
## $ Type_h <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0...
## $ Method_SP <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Method_PI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0...
## $ Method_S <dbl> 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1...
## $ SellerG_Kay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Hodges <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_McGrath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Noel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Gary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Miles <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Greg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Sweeney <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_RT <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Fletchers <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ SellerG_Woodards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Brad <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Biggin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ SellerG_Ray <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Buxton <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Marshall <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ SellerG_Barry <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_hockingstuart <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jellis <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SellerG_Nelson <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ CouncilArea_Whitehorse <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ CouncilArea_Manningham <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Brimbank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_HobsonsBay <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Bayside <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Melbourne <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Banyule <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_PortPhillip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Yarra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Maribyrnong <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Stonnington <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ CouncilArea_GlenEira <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0...
## $ CouncilArea_Darebin <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_MooneeValley <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Moreland <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Boroondara <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0...
## $ CouncilArea_ <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0...
We have seen earlier that our data has missing values & we need to impute it with mean of train data.Also we will exclude such imputations in variables data & Price. For loop demonstrate the same.
for(col in names(h)){
if(sum(is.na(h[,col]))>0 & !(col %in% c("data","Price"))){
h[is.na(h[,col]),col]=mean(h[h$data=='train',col],na.rm=T)
}
}
Let us see if there are still any NAs in our data using lapply function again.
lapply(h,function(x) sum(is.na(x)))
## $Rooms
## [1] 0
##
## $Price
## [1] 1885
##
## $Distance
## [1] 0
##
## $Postcode
## [1] 0
##
## $Bedroom2
## [1] 0
##
## $Bathroom
## [1] 0
##
## $Car
## [1] 0
##
## $Landsize
## [1] 0
##
## $BuildingArea
## [1] 0
##
## $YearBuilt
## [1] 0
##
## $data
## [1] 0
##
## $Suburb_Doncaster
## [1] 0
##
## $Suburb_AscotVale
## [1] 0
##
## $Suburb_Footscray
## [1] 0
##
## $Suburb_MooneePonds
## [1] 0
##
## $Suburb_Thornbury
## [1] 0
##
## $Suburb_Hampton
## [1] 0
##
## $Suburb_Yarraville
## [1] 0
##
## $Suburb_Balwyn
## [1] 0
##
## $Suburb_MalvernEast
## [1] 0
##
## $Suburb_Camberwell
## [1] 0
##
## $Suburb_Carnegie
## [1] 0
##
## $Suburb_PortMelbourne
## [1] 0
##
## $Suburb_Bentleigh
## [1] 0
##
## $Suburb_PascoeVale
## [1] 0
##
## $Suburb_BrightonEast
## [1] 0
##
## $Suburb_Hawthorn
## [1] 0
##
## $Suburb_BalwynNorth
## [1] 0
##
## $Suburb_Coburg
## [1] 0
##
## $Suburb_Northcote
## [1] 0
##
## $Suburb_Kew
## [1] 0
##
## $Suburb_Brighton
## [1] 0
##
## $Suburb_Glenroy
## [1] 0
##
## $Suburb_GlenIris
## [1] 0
##
## $Suburb_Essendon
## [1] 0
##
## $Suburb_Brunswick
## [1] 0
##
## $Suburb_SouthYarra
## [1] 0
##
## $Suburb_StKilda
## [1] 0
##
## $Suburb_Preston
## [1] 0
##
## $Suburb_Richmond
## [1] 0
##
## $Suburb_BentleighEast
## [1] 0
##
## $Suburb_Reservoir
## [1] 0
##
## $Type_u
## [1] 0
##
## $Type_h
## [1] 0
##
## $Method_SP
## [1] 0
##
## $Method_PI
## [1] 0
##
## $Method_S
## [1] 0
##
## $SellerG_Kay
## [1] 0
##
## $SellerG_Hodges
## [1] 0
##
## $SellerG_McGrath
## [1] 0
##
## $SellerG_Noel
## [1] 0
##
## $SellerG_Gary
## [1] 0
##
## $SellerG_Jas
## [1] 0
##
## $SellerG_Miles
## [1] 0
##
## $SellerG_Greg
## [1] 0
##
## $SellerG_Sweeney
## [1] 0
##
## $SellerG_RT
## [1] 0
##
## $SellerG_Fletchers
## [1] 0
##
## $SellerG_Woodards
## [1] 0
##
## $SellerG_Brad
## [1] 0
##
## $SellerG_Biggin
## [1] 0
##
## $SellerG_Ray
## [1] 0
##
## $SellerG_Buxton
## [1] 0
##
## $SellerG_Marshall
## [1] 0
##
## $SellerG_Barry
## [1] 0
##
## $SellerG_hockingstuart
## [1] 0
##
## $SellerG_Jellis
## [1] 0
##
## $SellerG_Nelson
## [1] 0
##
## $CouncilArea_Whitehorse
## [1] 0
##
## $CouncilArea_Manningham
## [1] 0
##
## $CouncilArea_Brimbank
## [1] 0
##
## $CouncilArea_HobsonsBay
## [1] 0
##
## $CouncilArea_Bayside
## [1] 0
##
## $CouncilArea_Melbourne
## [1] 0
##
## $CouncilArea_Banyule
## [1] 0
##
## $CouncilArea_PortPhillip
## [1] 0
##
## $CouncilArea_Yarra
## [1] 0
##
## $CouncilArea_Maribyrnong
## [1] 0
##
## $CouncilArea_Stonnington
## [1] 0
##
## $CouncilArea_GlenEira
## [1] 0
##
## $CouncilArea_Darebin
## [1] 0
##
## $CouncilArea_MooneeValley
## [1] 0
##
## $CouncilArea_Moreland
## [1] 0
##
## $CouncilArea_Boroondara
## [1] 0
##
## $CouncilArea_
## [1] 0
Now we are done with data preparation , lets separate the data now.We will filter it by train & test & remove data column from train data set & remove data & target variable Price from test data set.
h_train=h %>% filter(data=="train") %>% select(-data)
h_test=h %>% filter(data=="test") %>% select(-data,-Price)
Next we will break our train data into 2 parts in ratio 70:30. We will build model on one part & check its performance on the other.
s=sample(1:nrow(h_train),0.7*nrow(h_train))
h_train1=h_train[s,]
h_train2=h_train[-s,]
Let us glimpse our data again to check if we need to convert any data type.
glimpse(h_train1)
## Observations: 5,275
## Variables: 84
## $ Rooms <int> 5, 3, 5, 3, 3, 1, 3, 4, 3, 6, 5, 3, 3...
## $ Price <int> 2010000, 710000, 1035000, 465500, 134...
## $ Distance <dbl> 7.8, 9.4, 13.5, 15.0, 13.0, 6.4, 13.5...
## $ Postcode <int> 3124, 3081, 3042, 3021, 3166, 3011, 3...
## $ Bedroom2 <dbl> 2.78618, 3.00000, 2.78618, 2.78618, 3...
## $ Bathroom <dbl> 1.499247, 2.000000, 1.499247, 1.49924...
## $ Car <dbl> 1.510624, 4.000000, 1.510624, 1.51062...
## $ Landsize <dbl> 452.4478, 1.0000, 452.4478, 452.4478,...
## $ BuildingArea <dbl> 143.0472, 143.0472, 143.0472, 143.047...
## $ YearBuilt <dbl> 1961.046, 1961.046, 1961.046, 1961.04...
## $ Suburb_Doncaster <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_AscotVale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Footscray <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MooneePonds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Thornbury <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hampton <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Yarraville <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ Suburb_Balwyn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_MalvernEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Camberwell <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Carnegie <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PortMelbourne <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Bentleigh <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_PascoeVale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BrightonEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Hawthorn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BalwynNorth <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0...
## $ Suburb_Coburg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Northcote <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Kew <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brighton <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Glenroy <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ Suburb_GlenIris <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Essendon <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Brunswick <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_SouthYarra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_StKilda <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Preston <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Richmond <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_BentleighEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Suburb_Reservoir <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ Type_u <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ Type_h <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1...
## $ Method_SP <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Method_PI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ Method_S <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1...
## $ SellerG_Kay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Hodges <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_McGrath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0...
## $ SellerG_Noel <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ SellerG_Gary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jas <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1...
## $ SellerG_Miles <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Greg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Sweeney <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_RT <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Fletchers <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ SellerG_Woodards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Brad <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Biggin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Ray <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Buxton <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Marshall <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Barry <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_hockingstuart <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Jellis <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ SellerG_Nelson <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0...
## $ CouncilArea_Whitehorse <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Manningham <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Brimbank <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_HobsonsBay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Bayside <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Melbourne <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Banyule <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_PortPhillip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Yarra <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Maribyrnong <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1...
## $ CouncilArea_Stonnington <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_GlenEira <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Darebin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_MooneeValley <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Moreland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ CouncilArea_Boroondara <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ CouncilArea_ <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0...
As we know that RandomForest has 4 parameters- mtry,ntree,maxnodes & nodessize.Let us set the parameter values that we want to try out.
param=list(mtry=c(5,10,20,30),
ntree=c(50,100,200,500),
maxnodes=c(5,10,15,20),
nodesize=c(1,2,5,10))
The above leads to 444*4 i.e 256 possible combinations. We should technically CV performance of all these combination to find which one is the best. But that might be an overkill in terms of how much time it might take.If we are looking to do a 10 fold cross validation it will lead to 256X10 = 2560 randomforest models being built. If on an average each randomforest had 100 Trees, we are looking at 256,000 decision trees being built internally. This is beyond resource consuming. At the same time we need to realise that we dont really need to try out all the possible combination.Instead we can randomly select a much smaller subset to try.
Lets write a function which selects a random subset of combination.
subset_paras=function(full_list_para,n=10){
all_comb=expand.grid(full_list_para)
s=sample(1:nrow(all_comb),n)
subset_para=all_comb[s,]
return(subset_para)
}
If we pass the list of parameter values (full_list_para ) that we want to try and specify number of combinations ( n ) we want to try; it randomly selects those many combinations out of all possible parameter combinations and returns them in form of a dataframe.
We will take num_trials as 40 which is around 15% of total possible combinations (4^4 i.e 256). since a good value for num_trials is around 10-20% of total possible combination.
num_trials=40
my_params=subset_paras(param,num_trials)
We’ll be using cvTuning function from package cvTools to try out all these paramter combinations one by one. We’ll compare the cv error measure of all these and pick that as the best combination which results in lowest error.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
library(cvTools)
## Loading required package: lattice
## Loading required package: robustbase
In the below line of codes every time we find a parameter combination which has lowest error so far, it will be printed in the output.
myerror=9999999
for(i in 1:num_trials){
print(paste0('starting iteration:',i))
params=my_params[i,]
k=cvTuning(randomForest,Price~.,
data =h_train,
tuning =params,
folds = cvFolds(nrow(h_train), K=10, type = "random"),
seed =2
)
score.this=k$cv[,2]
if(score.this<myerror){
print(params)
myerror=score.this
print(myerror)
best_params=params
}
print('DONE')
}
## [1] "starting iteration:1"
## mtry ntree maxnodes nodesize
## 49 5 50 20 1
## [1] 478453
## [1] "DONE"
## [1] "starting iteration:2"
## [1] "DONE"
## [1] "starting iteration:3"
## [1] "DONE"
## [1] "starting iteration:4"
## mtry ntree maxnodes nodesize
## 227 20 50 15 10
## [1] 426611
## [1] "DONE"
## [1] "starting iteration:5"
## [1] "DONE"
## [1] "starting iteration:6"
## [1] "DONE"
## [1] "starting iteration:7"
## [1] "DONE"
## [1] "starting iteration:8"
## mtry ntree maxnodes nodesize
## 256 30 500 20 10
## [1] 404771
## [1] "DONE"
## [1] "starting iteration:9"
## [1] "DONE"
## [1] "starting iteration:10"
## [1] "DONE"
## [1] "starting iteration:11"
## [1] "DONE"
## [1] "starting iteration:12"
## [1] "DONE"
## [1] "starting iteration:13"
## [1] "DONE"
## [1] "starting iteration:14"
## [1] "DONE"
## [1] "starting iteration:15"
## [1] "DONE"
## [1] "starting iteration:16"
## [1] "DONE"
## [1] "starting iteration:17"
## [1] "DONE"
## [1] "starting iteration:18"
## [1] "DONE"
## [1] "starting iteration:19"
## [1] "DONE"
## [1] "starting iteration:20"
## [1] "DONE"
## [1] "starting iteration:21"
## [1] "DONE"
## [1] "starting iteration:22"
## [1] "DONE"
## [1] "starting iteration:23"
## [1] "DONE"
## [1] "starting iteration:24"
## [1] "DONE"
## [1] "starting iteration:25"
## [1] "DONE"
## [1] "starting iteration:26"
## [1] "DONE"
## [1] "starting iteration:27"
## [1] "DONE"
## [1] "starting iteration:28"
## [1] "DONE"
## [1] "starting iteration:29"
## [1] "DONE"
## [1] "starting iteration:30"
## [1] "DONE"
## [1] "starting iteration:31"
## [1] "DONE"
## [1] "starting iteration:32"
## [1] "DONE"
## [1] "starting iteration:33"
## [1] "DONE"
## [1] "starting iteration:34"
## [1] "DONE"
## [1] "starting iteration:35"
## [1] "DONE"
## [1] "starting iteration:36"
## [1] "DONE"
## [1] "starting iteration:37"
## [1] "DONE"
## [1] "starting iteration:38"
## [1] "DONE"
## [1] "starting iteration:39"
## [1] "DONE"
## [1] "starting iteration:40"
## [1] "DONE"
To know tentative performance measure , that’ll be latest value of myerror
myerror
## [1] 404771
Our best_params comes out to be
best_params
## mtry ntree maxnodes nodesize
## 256 30 500 20 10
Now we have best values of paramteres from cross validation. We’ll use these values to build our RandomForest model on entire training data and use that for prediction on test.
h.rf.final=randomForest(Price~.,
mtry=best_params$mtry,
ntree=best_params$ntree,
maxnodes=best_params$maxnodes,
nodesize=best_params$nodesize,
data=h_train)
test.pred=predict(h.rf.final,newdata = h_test)
In order to see first few records of predicted property Price for the test data set we need to use head function as under.
head(test.pred)
## 1 2 3 4 5 6
## 452407.0 617743.2 555290.8 1293809.2 1068916.5 961300.0