Read the data
The goal of the analysis here below is to build a model that is able to predict the estimated sales price of a real estate in Iceland-Capital area.
Dataset consisting of sale prices of around 40000 real estate sold in Iceland from 2008 - 2018.
Data can be found here too:
https://www.skra.is/library/Samnyttar-skrar-/Fyrirtaeki-stofnanir/Fasteignamat- 2019/gagnasafn_ib_2018.csv
gagnasafn_ib_2018<-read.csv2("gagnasafn_ib_2018.csv", header = T)
variables_desc<-read.csv2("variables.csv",colClasses = "character", header = F)
# I've copied and pasted the variables listed in the pdf from Oli in a notepad file.
# I've named the file variables.csv
# This is how I'll select the variable from the full data set (gagnasafn_ib_2018)
names(gagnasafn_ib_2018)
## [1] "faerslunumer" "rfastnum" "kdagur" "nuvirdi"
## [5] "kaupverd" "grfast" "grlaus" "teg_eign"
## [9] "svfn" "byggd" "lodpflm" "abnflm"
## [13] "nythl" "adferd" "efnu" "byggar"
## [17] "mbstig" "ist120" "efstah" "haednr"
## [21] "fjibmhl" "fjmib" "lyfta" "ummal"
## [25] "haedflm" "birtm2" "ibm2" "ntm2"
## [29] "fjhaed" "fjbilsk" "fjbkar" "fjsturt"
## [33] "fjklos" "fjeld" "fjherb" "fjstof"
## [37] "fjgeym" "studull" "stig10" "ib1m2"
## [41] "ib2m2" "ib3m2" "bilskurm2" "bilgm2"
## [45] "svalm2" "geymm2" "rism2" "matssvaedi"
## [49] "undirmatssvaedi" "fjibhaed" "ibteg" "sernotad"
## [53] "fjolnotad"
head(variables_desc)
## V1 V2
## 1 rfastnum property id
## 2 kdagur salesdate
## 3 nuvirdi price
## 4 teg_eign type of property (see below)
## 5 svfn city/town id
## 6 byggar year built
I discard variables that are irrelevant for the analysis - as guided by Oli.
NB: my assumption below is that fjbilsk is the fjbilast variable that Oli refers to.
Other discards:
Number of storage rooms should not be important, given we have storage room area variable. Hence discard number of storage rooms (fjgeym).
Property circumference variable (ummal) - I drop this on grounds that property area variable provides all the information that is needed here.
Same reasoning can be applied to variable ntm2 (netto area). The variable ibm2 (property area) should carry similar information.
Property id (rfastnum) is also not a meaningful explanatory variable, hence discarded.
ibteg and teg_eign carry exactly the same info: Type of property. So we discard ibteg
variables<-trimws(variables_desc[,1])
# Remove the useless spaces at the end that will create problem while matching.
variables %in% names(gagnasafn_ib_2018)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE
# We assume: fjbilast = fjbilsk
gagnasafn_ib_2018<-dplyr::rename(gagnasafn_ib_2018, fjbilast = fjbilsk)
data<-gagnasafn_ib_2018 %>% dplyr::select(all_of(variables))
data<-dplyr::select(data, -fjgeym, -ummal, -rfastnum, -ntm2, - ibteg)
I discard real estates outside the capital area and I also create a new “year” and “month” variable.
data.rkv<-data %>% filter(svfn<2000)
data.rkv$svfn<-data.rkv$svfn %>% as.factor()
data.rkv$year<-substr(data.rkv$kdagur,7,10) %>% as.integer()
data.rkv$month<-substr(data.rkv$kdagur,4,5) %>% factor()
attributes(data.rkv$month)
## $levels
## [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12"
##
## $class
## [1] "factor"
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
levels(data.rkv$month)<-month_levels # rename the levels
dplyr::select(data.rkv, kdagur, year, month) %>% sample_n(6)
## kdagur year month
## 1 01.01.2015 2015 Jan
## 2 04.10.2013 2013 Oct
## 3 05.05.2017 2017 May
## 4 11.12.2013 2013 Dec
## 5 03.06.2015 2015 Jun
## 6 24.09.2015 2015 Sep
# just to check that I have not messed up the date/month/year
data.rkv<-dplyr::select(data.rkv, -kdagur) # no need to keep kdagur at this point
We now check whether the variables are expressed with the appropriate class and change it if needed.
str(data.rkv) # to check if there is any variable that we need to change the class
## 'data.frame': 35131 obs. of 24 variables:
## $ nuvirdi : int 16400 27871 34592 24243 47331 11550 25175 25253 21181 25694 ...
## $ teg_eign : Factor w/ 12 levels "Einbýlishús",..: 6 6 6 6 1 6 6 6 6 6 ...
## $ svfn : Factor w/ 7 levels "0","1000","1100",..: 1 1 4 5 1 1 2 4 5 2 ...
## $ byggar : Factor w/ 149 levels "(null)","1787",..: 104 110 135 122 127 75 119 137 89 134 ...
## $ efstah : int 4 10 3 3 2 2 3 6 2 9 ...
## $ fjmib : int 1 1 1 1 1 1 1 1 1 1 ...
## $ lyfta : int 0 1 0 0 0 0 0 3 0 2 ...
## $ ibm2 : num 57.7 72.8 116.8 96 205.4 ...
## $ fjhaed : int 1 1 1 1 3 1 1 1 2 1 ...
## $ fjbilast : int 0 1 0 0 0 0 0 1 0 0 ...
## $ fjbkar : int 1 1 1 1 1 0 1 0 0 1 ...
## $ fjsturt : int 0 0 1 0 1 1 0 1 1 0 ...
## $ fjklos : int 1 1 1 1 2 1 1 2 1 1 ...
## $ fjeld : int 1 1 1 1 1 1 1 1 1 1 ...
## $ fjherb : int 1 2 3 3 5 1 2 1 3 3 ...
## $ fjstof : int 1 1 1 1 2 1 1 1 2 1 ...
## $ stig10 : num 10 10 10 10 9.9 10 10 10 10 10 ...
## $ bilskurm2 : Factor w/ 642 levels "-70","(null)",..: 3 3 3 3 564 3 134 3 3 3 ...
## $ svalm2 : Factor w/ 723 levels "-0,2","(null)",..: 3 617 664 3 3 3 549 3 480 40 ...
## $ geymm2 : Factor w/ 769 levels "(null)","0","0,1",..: 55 2 2 27 2 14 656 752 2 747 ...
## $ matssvaedi : int 161 72 700 620 290 100 330 510 680 340 ...
## $ undirmatssvaedi: int 0 0 0 0 0 0 0 0 0 0 ...
## $ year : int 2015 2015 2015 2014 2015 2012 2012 2012 2012 2012 ...
## $ month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 11 12 1 12 1 1 1 1 ...
data.rkv$byggar<-data.rkv$byggar %>% as.character() %>% as.integer()
data.rkv$bilskurm2<-data.rkv$bilskurm2 %>% as.numeric()
data.rkv$svalm2<-data.rkv$svalm2 %>% as.numeric()
data.rkv$geymm2<-data.rkv$geymm2 %>% as.numeric()
data.rkv$matssvaedi<-data.rkv$matssvaedi %>% factor()
Now we are going to check the range of the variables.
We decide to focus on real estate only and discard hotels and workshops (“Hótelstarfsemi”, “Vinnustofa”) as they may obey to different market logic, and in order to narrow the focus of our investigation. It would be like to compare apples and oranges.
Herbergi and Séreign have been discarded as well since there are very few observations with such label and they are not well defined.
Fjölbýlishús, Íbúðareign and Íbúðarhús have been collapsed under level Íbúðareign since they all describe the same type of house.
The big majority of the observations have 0 as undirmatssvaedi (sub location of property) which does not indicate any specific sublocation. In our opinion this invalidate the whole variable and it is a valid reason to exclude it from the study.
data.rkv %>% summary()
## nuvirdi teg_eign svfn byggar
## Min. : 5900 Íbúðareign :28776 0 :19957 Min. :1841
## 1st Qu.: 25200 Einbýlishús : 3182 1000: 6090 1st Qu.:1963
## Median : 33152 Raðhús : 2253 1100: 645 Median :1983
## Mean : 36889 Parhús : 780 1300: 2483 Mean :1981
## 3rd Qu.: 43969 Ósamþykkt íbúð: 66 1400: 4778 3rd Qu.:2003
## Max. :956146 Íbúðarhús : 47 1604: 1173 Max. :2018
## (Other) : 27 1606: 5 NA's :10
## efstah fjmib lyfta ibm2 fjhaed
## Min. : 1.000 Min. :1.000 Min. :0.000 Min. : 17.7 Min. :1.0
## 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.:0.000 1st Qu.: 74.7 1st Qu.:1.0
## Median : 3.000 Median :1.000 Median :0.000 Median : 95.4 Median :1.0
## Mean : 3.401 Mean :1.006 Mean :0.433 Mean :105.1 Mean :1.2
## 3rd Qu.: 4.000 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:122.0 3rd Qu.:1.0
## Max. :13.000 Max. :4.000 Max. :8.000 Max. :420.1 Max. :4.0
##
## fjbilast fjbkar fjsturt fjklos
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.000
## Median :0.0000 Median :1.0000 Median :1.0000 Median :1.000
## Mean :0.2052 Mean :0.7711 Mean :0.5569 Mean :1.197
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :8.0000 Max. :3.0000 Max. :4.0000 Max. :6.000
##
## fjeld fjherb fjstof stig10
## Min. :0.000 Min. : 0.000 Min. : 0.00 Min. : 4.000
## 1st Qu.:1.000 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.:10.000
## Median :1.000 Median : 2.000 Median : 1.00 Median :10.000
## Mean :1.005 Mean : 2.498 Mean : 1.24 Mean : 9.972
## 3rd Qu.:1.000 3rd Qu.: 3.000 3rd Qu.: 1.00 3rd Qu.:10.000
## Max. :4.000 Max. :24.000 Max. :14.00 Max. :10.000
##
## bilskurm2 svalm2 geymm2 matssvaedi
## Min. : 1.00 Min. : 1 Min. : 1.0 100 : 2314
## 1st Qu.: 3.00 1st Qu.: 3 1st Qu.: 2.0 600 : 2218
## Median : 3.00 Median :110 Median :124.0 320 : 1829
## Mean : 48.83 Mean :241 Mean :266.2 90 : 1714
## 3rd Qu.: 3.00 3rd Qu.:481 3rd Qu.:586.0 20 : 1669
## Max. :638.00 Max. :723 Max. :768.0 130 : 1466
## (Other):23921
## undirmatssvaedi year month
## Min. : 0.000 Min. :2012 Nov : 3254
## 1st Qu.: 0.000 1st Qu.:2013 Mar : 3194
## Median : 0.000 Median :2015 Oct : 3161
## Mean : 5.688 Mean :2015 May : 3095
## 3rd Qu.: 0.000 3rd Qu.:2016 Dec : 3006
## Max. :107.000 Max. :2018 Jun : 2944
## (Other):16477
data.rkv$teg_eign %>% summary()
## Einbýlishús Fjölbýlishús Gistihús Herbergi Hótelstarfsemi
## 3182 19 0 1 3
## Íbúðareign Íbúðarhús Ósamþykkt íbúð Parhús Raðhús
## 28776 47 66 780 2253
## Séreign Vinnustofa
## 2 2
data.rkv$teg_eign %>% attributes()
## $levels
## [1] "Einbýlishús" "Fjölbýlishús" "Gistihús" "Herbergi"
## [5] "Hótelstarfsemi" "Íbúðareign" "Íbúðarhús" "Ósamþykkt íbúð"
## [9] "Parhús" "Raðhús" "Séreign" "Vinnustofa"
##
## $class
## [1] "factor"
data.rkv$teg_eign<-fct_collapse(data.rkv$teg_eign,
Commercial = c("Hótelstarfsemi", "Vinnustofa"),
NotDefined = c("Herbergi", "Séreign"),
Íbúðareign = c("Fjölbýlishús", "Íbúðareign", "Íbúðarhús")
)
data.rkv<-subset(data.rkv, teg_eign != "Commercial" & teg_eign != "NotDefined")
data.rkv$teg_eign<-droplevels(data.rkv$teg_eign)
# remove levels not present in the dataset
data.rkv$undirmatssvaedi %>% factor() %>% summary()
## 0 1 2 3 4 6 7 8 9 10 11 12 13
## 29671 283 66 199 68 32 19 18 6 6 5 218 91
## 14 15 16 17 18 20 21 22 23 24 25 26 27
## 51 126 65 232 331 116 189 348 72 146 130 76 32
## 28 31 32 33 34 35 36 37 38 39 40 41 42
## 206 21 7 9 10 10 10 34 10 4 33 2 5
## 43 44 45 46 47 48 49 50 51 52 53 54 55
## 11 247 59 231 73 54 157 15 79 103 2 17 17
## 56 57 59 60 61 62 63 64 65 101 102 103 104
## 34 14 101 78 19 198 29 11 17 25 11 15 189
## 105 106 107
## 190 39 131
# the big majority of the observations have 0 as undirmatssvaedi (sub location of property) which does not indicate any specific sublocation
data.rkv<-dplyr::select(data.rkv, - undirmatssvaedi)
levels(data.rkv$svfn)
## [1] "0" "1000" "1100" "1300" "1400" "1604" "1606"
data.rkv$svfn<-fct_recode(data.rkv$svfn,
Reykjavík = "0",
Kópavogur = "1000",
Seltjarnarnes = "1100",
Garðabær = "1300",
Hafnarfjörður = "1400",
Mosfellsbær = "1604",
Kjósarhreppur= "1606"
)
levels(data.rkv$svfn)
## [1] "Reykjavík" "Kópavogur" "Seltjarnarnes" "Garðabær"
## [5] "Hafnarfjörður" "Mosfellsbær" "Kjósarhreppur"
Correlation matrix
data.rkv %>%
ggcorr(label=T, label_size = 2)
As expected, pric (nuvirdi) is positively correlated to the apartment area expressed in square meters (ibm2).
Ibm2 is correlated to the number of rooms and toilets.
Year also is positively correlated with nuvirdi (price) - perhaps unsurprisingly as there was economic growth in each year in 2012-2018 (growing may stimulate demand, though of course house build - a supply factor - could affect price the other direction).
Additional analysis with GGPLOT
ggplot(data.rkv) +
geom_boxplot(mapping=aes(x=teg_eign, y=nuvirdi))
filter(data, nuvirdi==max(nuvirdi)) # suspicious observation, nuvirdi is extremely high.
## kdagur nuvirdi teg_eign svfn byggar efstah fjmib lyfta ibm2 fjhaed
## 1 27.05.2016 956146 Íbúðareign 1604 2016 3 1 1 110.5 1
## fjbilast fjbkar fjsturt fjklos fjeld fjherb fjstof stig10 bilskurm2 svalm2
## 1 1 0 1 1 1 2 1 10 0 19,5
## geymm2 matssvaedi undirmatssvaedi
## 1 4,5 850 0
# This really looks like an error, we exclude the observation. Data instead of data.rkv so we can see the value for the columns excluded as well.
data.rkv<-data.rkv[-which.max(data.rkv$nuvirdi),]
ggplot(data.rkv) +
geom_boxplot(mapping=aes(x=teg_eign, y=nuvirdi)) +
scale_y_continuous(labels = comma, limits=c(0, 200000)) +
labs(x="type of real estate", y="nuvirdi = price")
ggplot(data.rkv) +
geom_histogram(aes(x=nuvirdi)) +
scale_x_continuous(labels = comma, limits=c(0, 100000)) +
labs(x="nuvirdi = price")
# distribution of nuvirdi seems slightly skewed
ggplot(data.rkv) +
geom_histogram(aes(x=log(nuvirdi))) +
scale_x_continuous(labels = comma) +
labs(x="nuvirdi = price")
# distribution of nuvirdi looks more normal now.
ggplot(data.rkv) +
geom_density(aes(x=nuvirdi , col=teg_eign)) +
scale_x_continuous(labels = comma,limits=c(0, 100000)) +
labs(x="nuvirdi = price", col="type of real estate")
This plot above shows that íbuð (appartments) and especially óasmþyktt íbuð tend to be much cheaper than other types Parhús, Raðhús and Einbýlishús.
ggplot(data.rkv) +
geom_density(aes(x=log(nuvirdi) , col=teg_eign)) +
scale_x_continuous(labels = comma) +
labs(x="nuvirdi = price", col="type of real estate")
ggplot(data.rkv, mapping=aes(x=byggar, y=nuvirdi)) +
geom_point(alpha = 1/10, position = "jitter") +
geom_smooth() +
scale_x_continuous(name="building year",
breaks=seq(0,2019,10)) +
scale_y_continuous(labels = comma)
Here we see that there is a relatively flat relationship between price and building year.
data.rkv %>% ggplot() +
geom_boxplot(mapping=aes(x=as.factor(year), y=nuvirdi)) +
labs(x="year", y="nuvirdi = price") +
scale_y_continuous(labels = comma, limits=c(0, 250000))
data.rkv %>% ggplot() +
geom_boxplot(mapping=aes(x=month, y=nuvirdi)) +
labs(y="nuvirdi = price") +
scale_y_continuous(labels = comma, limits=c(0, 250000))
These chart suggest that “year” of purchase may be important determinant of price, though “month” less so.
data.rkv %>% group_by(year,teg_eign) %>%
dplyr::summarise(avg_price=mean(nuvirdi)) %>%
ggplot() +
geom_line(mapping=aes(x=year, y=avg_price, col=teg_eign)) +
labs(col="type of real estate") +
scale_x_continuous(breaks=seq(2000,2019,1))
This suggests that for the most part, prices may increase by year, but that in the final year the increase is less substantial, indeed prices plateau for einbýlishús in 2018 and a decrease for ósamþykkt íbúð in 2018.
ggplot(data.rkv,mapping = aes(x=ibm2,y=nuvirdi)) +
geom_point(col="light green") +
geom_smooth() +
scale_y_continuous(labels = comma) +
labs(x="property area [m^2]",col="type of real estate") +
facet_wrap(~teg_eign)
Here we see that ósamþykkt íbúð tend to be much smaller and cheaper, and that locations Hafnarfjórður and Mosfellsbær generally look cheaper per square meter than other areas.
ggplot(data.rkv,mapping = aes(x=ibm2,y=nuvirdi)) +
geom_point(col="light green") +
geom_smooth() +
scale_y_continuous(labels = comma) +
labs(x="property area [m^2]",col="type of real estate") +
facet_wrap(~svfn)
data.rkv %>% ggplot() +
geom_boxplot(mapping=aes(x=svfn, y=nuvirdi)) +
labs(x="City ID", y="nuvirdi = price") +
scale_y_continuous(labels = comma, limits=c(0, 250000))
The first box plot above shows the difference in prices in cities, while the second shows differences by neighbourhood matsvæði.
This shows that not all the neighbors are alike. This is reinforced by the plot below which compares the average price per square meter between neighbors, showing that some areas have a greater premium than other.
data.rkv %>% ggplot() +
geom_boxplot(mapping=aes(x=matssvaedi, y=nuvirdi)) +
labs(x="Neighbor", y="nuvirdi = price") +
scale_y_continuous(labels = comma, limits=c(0, 250000)) +
theme(axis.text.x = element_text(angle = 90))
However, it does not seem to be convenient to keep all the levels of matssvaedi - it makes the read out of results very lengthy and cumbersome for the reader. For that reason, we create a new variable that bunches neighbourhoods into fourteen proximate groupings.
The plot here below compares the average price per square meter between neighborhoods. Intuitively we can say that some areas are more premium than other.
data.rkv %>% mutate(m2price=nuvirdi/ibm2) %>% group_by(matssvaedi, svfn) %>%
dplyr::summarise(avg_m2price=mean(m2price)) %>%
ggplot() +
geom_col(aes(x=matssvaedi, y=avg_m2price, fill=svfn)) +
labs(x="Neighbor", y="Avg price per m2", fill="City") +
scale_y_continuous(labels = comma) +
theme(axis.text.x = element_text(angle = 90))
newneighborhoods<-read_excel("NewNeighbourhoods.xlsx") %>% data.frame()
newneighborhoods$Matsv<-newneighborhoods$Matsv %>% as.factor()
data.rkv<-left_join(data.rkv,newneighborhoods, by = c("matssvaedi"="Matsv"))
data.rkv %>% dplyr::select(svfn,matssvaedi,City,New.group.Number, New.Group.name) %>% sample_n(10) # run it few time just to check that the match was done correctly.
## svfn matssvaedi City New.group.Number
## 1 Hafnarfjörður 600 Hafnarfjorður 12
## 2 Reykjavík 100 Reykjavik 2
## 3 Reykjavík 200 Reykjavik 6
## 4 Garðabær 550 Garðabær 10
## 5 Reykjavík 80 Reykjavik 2
## 6 Reykjavík 80 Reykjavik 2
## 7 Reykjavík 130 Reykjavik 3
## 8 Hafnarfjörður 620 Hafnarfjorður 12
## 9 Reykjavík 150 Reykjavik 4
## 10 Reykjavík 11 Reykjavik 1
## New.Group.name
## 1 Hafnarfjorður
## 2 Reykjavik Near centre
## 3 Reykjavik City limits
## 4 Garðabær
## 5 Reykjavik Near centre
## 6 Reykjavik Near centre
## 7 Reykjavik Grafarvogur and on
## 8 Hafnarfjorður
## 9 Reykjavik Breiðholt and on
## 10 Reykjavik Central
# we can drop svfn, matssvaedi, city and New.group.Number and keep only New.Group.name
data.rkv<-data.rkv %>% dplyr::select(-svfn,-matssvaedi,-City,-New.group.Number,)
data.rkv$New.Group.name %>% class()
## [1] "character"
data.rkv$New.Group.name<-as.factor(data.rkv$New.Group.name)
levels(data.rkv$New.Group.name)
## [1] "Garðabær" "Garðabær Peninsula"
## [3] "Hafnarfjorður" "Kjosahreppur"
## [5] "Kopavogur Heart of" "Kopavogur Inland"
## [7] "Mosfellsbær" "Reykjavik Breiðholt and on"
## [9] "Reykjavik Central" "Reykjavik City limits"
## [11] "Reykjavik Grafarvogur and on" "Reykjavik Near centre"
## [13] "Reykjavik Near Esja" "Seltjarnarnes"
data.rkv %>% mutate(m2price=nuvirdi/ibm2) %>% group_by(New.Group.name) %>%
dplyr::summarise(avg_m2price=mean(m2price)) %>%
ggplot() +
geom_col(aes(x=New.Group.name, y=avg_m2price, fill=New.Group.name)) +
labs(x="New Neighborhoods", y="Avg m2 price") +
scale_y_continuous(labels = comma) +
theme(axis.text.x = element_text(angle = 90),legend.position = "none")
The exploratory data analysis could go on and on but, since we have decided to use tree based models, we decided that we can move to the next step.
data.rkv<-dplyr::rename(data.rkv, new.area = New.Group.name)
# I'am renaming it simpy because in the tree model it looks better.
data.rkv4tree<-na.omit(data.rkv)
set.seed(4444)
train.index<-sample(1:nrow(data.rkv4tree), nrow(data.rkv4tree)*0.8)
train.set<-data.rkv4tree[train.index,]
test.set<-data.rkv4tree[-train.index,]
tree.reg_rp<-rpart(nuvirdi~. , train.set, method = 'anova')
yhat<-predict(tree.reg_rp, newdata = test.set)
testMSE<-mean((yhat-test.set$nuvirdi)^2)
sqrt(testMSE) # most probably the accuracy using the tree model is lower than other regression model
## [1] 9786.866
# MSE is used as an estimator for sigma^2
rpart.plot(tree.reg_rp, faclen=2)
ibm2 appears on many nodes of the tree, this might indicate that it size of the house is an important indicator when predicting the sale price.
This is in line with the correlation matrix we plotted earlier.
Sales year and location of the real estate (area), is also another important variable when predicting sales price.
Bagging and random forest grow many trees, each tree is grown based on a bootstrapped data frame. We do this in order to reduce the variance of the basic tree model (high variance). So basically we grow n tree each tree gives me a prediction. In order to know the estimated prediction error, for each ith obs, we use all the predictions whose tree was grown without that ith obs. Bootstrapping uses replacement therefore it has been observed that on average bootstrapping sample 2/3 of the observation. This means that for each obs, we can use approx n tree/3 predictions. We average these n tree/3 and we end up having one prediction for each obs. The prediction is then compared to the response in the test data set. Now we can calculate the MSE.
do.trace = T prints the intermediate OOB MSE which help us to set the number of tree that it is sensible to grow.
OOB MSE: it is the predicted error estimated on the out of bag obs.
p<-ncol(data.rkv4tree)-1
m<-sqrt(p) %>% ceiling()
m.val<-c(m, m+1, m+2, m+3)
test.rmse<-rep(NA,length(m.val))
for(i in seq_along(m.val)){
rf.fit<-randomForest(x=train.set[,-1],
y=train.set[,1],
ntree = 150,
mtry = m.val[i],
do.trace = F,
importance = T)
test.pred = predict(rf.fit, newdata = test.set)
test.rmse[i] = mean((test.set$nuvirdi - test.pred)^2) %>% sqrt()
}
ggplot() + geom_col(mapping=aes(x=m.val,y=test.rmse)) +
labs(x="m value", y="RMSE")
best.m<-m.val[which.min(test.rmse)]
# The higher the ntree the lower the variance of the prediction
# rf2$mse: vector of mean square errors: sum of squared residuals divided by n
The plot here above shows how RMSE changes depending on m. is a number specifying the size of the random subset of variables considered at each split.
Apparently larger of the square root of the number of predictors return slightly better results.
varImpPlot(rf.fit, main = "Importance")
rmse.rf<-min(test.rmse)
rmse.rf
## [1] 5577.282
The graph created with varImpPlot() shows the importance of each variables. We get similar results as the simple tree grown earlier: year (year of sale), ibm2 (size), new.area (neighborhood) and byggar are the most important predictor for nuvirdi.
RandomForest is quite expensive computationally speaking, the package Ranger fit a model in a much faster way. This allows us to grow more trees.
Ranger
rf.ranger<-ranger(x=train.set[,-1],
y=train.set[,1],
num.trees = 5000,
mtry = m+3,
importance = "permutation",
verbose = F)
yhat.rng<-predict(rf.ranger, data = test.set)
mse.rf.rng<-mean((yhat.rng$predictions - test.set$nuvirdi)^2)
rmse.ranger<-mse.rf.rng %>% sqrt()
rmse.ranger
## [1] 5858.928
Let’s try to use RandomForest on a subset of variables only, the most important ones. Reducing the variable space should dramatically reduce the computational time, therefore we can increase the number of tree grown.
However this approach doesn’t seem to pay off. MSE is higher than the one we got using all the variables.
Randoom Forest with selected variables
train.set.snk<-dplyr::select(train.set, nuvirdi, year, ibm2, byggar,teg_eign, bilskurm2)
names(train.set.snk)
## [1] "nuvirdi" "year" "ibm2" "byggar" "teg_eign" "bilskurm2"
dim(train.set.snk)
## [1] 28089 6
rf.fit.snk<-randomForest(x=train.set.snk[,-1],
y=train.set.snk[,1],
ntree=2000,
do.trace=F,
importance = T)
yhat.rf.snk<-predict(rf.fit.snk, newdata = test.set)
mse.rf.snk<-mean((yhat.rf.snk - test.set$nuvirdi)^2)
rmse.rf.snk<-mse.rf.snk %>% sqrt()
rmse.rf.snk
## [1] 8109.159
Here with boosting we don’t grow every tree based on a bootstrapped data frame, instead we are growing trees sequentially, in the sense that we fit a tree based on the residuals from the previous tree. We have 3 main parameters to set when using boosting with tree models.
B: number of trees to be grown
d: split for each tree grown. d is often small (1,..5)
\(\lambda\): learning rate, which basically set the weight on the new residual prediction which will update the response prediction. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small \(\lambda\) can require using a very large value of B in order to achieve good performance.
\[ f(x) ← \hat{f}(x) + λ\hat{f}^b(x)\]
Boosting with Caret
# BOOSTING WITH CARET with tuning
train.ctrl<-trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
search="grid",
verboseIter = F)
myGrid<-expand.grid(n.trees = 150,
interaction.depth = c(1,3,5),
shrinkage = c(0.001, 0.05, 0.15, 0.3),
n.minobsinnode = 10)
set.seed(4444)
gbm.caret.tune<-train(nuvirdi~ .,
data = train.set,
method = "gbm",
#distribution = "gaussian",
#metric=metric,
trControl = train.ctrl,
verbose = F,
tuneGrid=myGrid)
print(gbm.caret.tune)
## Stochastic Gradient Boosting
##
## 28089 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 22472, 22472, 22472, 22470, 22470, 22472, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth RMSE Rsquared MAE
## 0.001 1 16352.327 0.4422578 11688.169
## 0.001 3 16051.196 0.6009087 11460.462
## 0.001 5 15904.413 0.6527771 11340.201
## 0.050 1 8827.950 0.7625360 5518.930
## 0.050 3 7150.959 0.8336398 4382.220
## 0.050 5 6666.950 0.8536838 4072.058
## 0.150 1 7544.895 0.8124846 4503.224
## 0.150 3 6373.239 0.8647891 3864.279
## 0.150 5 6062.805 0.8775310 3652.632
## 0.300 1 7406.800 0.8172655 4474.819
## 0.300 3 6189.903 0.8721890 3758.934
## 0.300 5 6076.789 0.8767428 3650.214
##
## Tuning parameter 'n.trees' was held constant at a value of 150
## Tuning
## parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 5, shrinkage = 0.15 and n.minobsinnode = 10.
plot(gbm.caret.tune)
From the plot we can see that boosting with 3 and 5 nodes yields similar results.
myGrid.best<-gbm.caret.tune$bestTune
myGrid.best
## n.trees interaction.depth shrinkage n.minobsinnode
## 9 150 5 0.15 10
set.seed(1)
gbm.caret.best<-train(nuvirdi~ .,
data=train.set,
method = "gbm",
distribution="gaussian",
trControl = train.ctrl,
verbose = F,
tuneGrid=myGrid.best)
test.pred<-predict(gbm.caret.best, newdata = test.set)
boostcaret.test.mse<-mean((test.set$nuvirdi - test.pred)^2)
rmse.boost.caret<-boostcaret.test.mse %>% sqrt()
rmse.boost.caret
## [1] 5836.498
summary(gbm.caret.best)
## var
## ibm2 ibm2
## year year
## byggar byggar
## new.areaReykjavik Central new.areaReykjavik Central
## fjklos fjklos
## bilskurm2 bilskurm2
## fjbilast fjbilast
## geymm2 geymm2
## new.areaReykjavik Breiðholt and on new.areaReykjavik Breiðholt and on
## new.areaSeltjarnarnes new.areaSeltjarnarnes
## fjsturt fjsturt
## efstah efstah
## new.areaHafnarfjorður new.areaHafnarfjorður
## teg_eignÍbúðareign teg_eignÍbúðareign
## new.areaReykjavik Near centre new.areaReykjavik Near centre
## fjherb fjherb
## fjhaed fjhaed
## fjstof fjstof
## stig10 stig10
## lyfta lyfta
## new.areaMosfellsbær new.areaMosfellsbær
## svalm2 svalm2
## new.areaReykjavik Near Esja new.areaReykjavik Near Esja
## fjmib fjmib
## new.areaReykjavik City limits new.areaReykjavik City limits
## new.areaGarðabær Peninsula new.areaGarðabær Peninsula
## new.areaReykjavik Grafarvogur and on new.areaReykjavik Grafarvogur and on
## monthDec monthDec
## fjeld fjeld
## monthFeb monthFeb
## new.areaKopavogur Inland new.areaKopavogur Inland
## teg_eignÓsamþykkt íbúð teg_eignÓsamþykkt íbúð
## new.areaKopavogur Heart of new.areaKopavogur Heart of
## monthOct monthOct
## monthNov monthNov
## monthMar monthMar
## teg_eignRaðhús teg_eignRaðhús
## fjbkar fjbkar
## monthSep monthSep
## teg_eignParhús teg_eignParhús
## monthApr monthApr
## monthMay monthMay
## monthJun monthJun
## monthJul monthJul
## monthAug monthAug
## new.areaKjosahreppur new.areaKjosahreppur
## rel.inf
## ibm2 63.09222080
## year 15.09547925
## byggar 4.41394645
## new.areaReykjavik Central 3.58738340
## fjklos 2.64111396
## bilskurm2 2.49439761
## fjbilast 0.99756035
## geymm2 0.92120740
## new.areaReykjavik Breiðholt and on 0.86523284
## new.areaSeltjarnarnes 0.72579048
## fjsturt 0.67401911
## efstah 0.66662825
## new.areaHafnarfjorður 0.66402937
## teg_eignÍbúðareign 0.51578264
## new.areaReykjavik Near centre 0.34117072
## fjherb 0.29027142
## fjhaed 0.26722034
## fjstof 0.19534434
## stig10 0.16637417
## lyfta 0.16328227
## new.areaMosfellsbær 0.14365699
## svalm2 0.13221747
## new.areaReykjavik Near Esja 0.12041416
## fjmib 0.11149864
## new.areaReykjavik City limits 0.09660882
## new.areaGarðabær Peninsula 0.07314629
## new.areaReykjavik Grafarvogur and on 0.06542908
## monthDec 0.06300707
## fjeld 0.05099986
## monthFeb 0.05067215
## new.areaKopavogur Inland 0.04933845
## teg_eignÓsamþykkt íbúð 0.04658538
## new.areaKopavogur Heart of 0.04326940
## monthOct 0.03498325
## monthNov 0.03147157
## monthMar 0.03017410
## teg_eignRaðhús 0.02917988
## fjbkar 0.02680086
## monthSep 0.02209144
## teg_eignParhús 0.00000000
## monthApr 0.00000000
## monthMay 0.00000000
## monthJun 0.00000000
## monthJul 0.00000000
## monthAug 0.00000000
## new.areaKjosahreppur 0.00000000
tree_comp<-rbind(random_forest = rmse.rf,
random_forest_ranger = rmse.ranger,
random_forest_shrunk = rmse.rf.snk,
boosting_caret = rmse.boost.caret) %>% data.frame()
colnames(tree_comp)<-"Estimated_prediction_error"
kable(tree_comp) %>%
kable_styling(bootstrap_options = "striped", full_width = T)
| Estimated_prediction_error | |
|---|---|
| random_forest | 5577.282 |
| random_forest_ranger | 5858.928 |
| random_forest_shrunk | 8109.159 |
| boosting_caret | 5836.498 |
Random Forest seems to be the most accurate model as it returns the lowest MSE/RMSE. It is worth noticing that ranger package, even though is not as accurate as the standard RandomForest function, ranks as the second most accurate method but it handles a “big” variable space in a much better way then the other algorithms.
Boosting might be better than random forest but needs extensive tuning, which corresponds to many possible model configurations.
We conclude that the best model found can predict sale price with a standard error equal to 5577.2815323 [k ISK].