Getting and Cleaning Data

Read the data

The goal of the analysis here below is to build a model that is able to predict the estimated sales price of a real estate in Iceland-Capital area.

Dataset consisting of sale prices of around 40000 real estate sold in Iceland from 2008 - 2018.

Data can be found here too:

https://www.skra.is/library/Samnyttar-skrar-/Fyrirtaeki-stofnanir/Fasteignamat- 2019/gagnasafn_ib_2018.csv

gagnasafn_ib_2018<-read.csv2("gagnasafn_ib_2018.csv", header = T)

variables_desc<-read.csv2("variables.csv",colClasses = "character", header = F)
# I've copied and pasted the variables listed in the pdf from Oli in a notepad file.
# I've named the file variables.csv
# This is how I'll select the variable from the full data set (gagnasafn_ib_2018)
names(gagnasafn_ib_2018)

##  [1] "faerslunumer"    "rfastnum"        "kdagur"          "nuvirdi"        
##  [5] "kaupverd"        "grfast"          "grlaus"          "teg_eign"       
##  [9] "svfn"            "byggd"           "lodpflm"         "abnflm"         
## [13] "nythl"           "adferd"          "efnu"            "byggar"         
## [17] "mbstig"          "ist120"          "efstah"          "haednr"         
## [21] "fjibmhl"         "fjmib"           "lyfta"           "ummal"          
## [25] "haedflm"         "birtm2"          "ibm2"            "ntm2"           
## [29] "fjhaed"          "fjbilsk"         "fjbkar"          "fjsturt"        
## [33] "fjklos"          "fjeld"           "fjherb"          "fjstof"         
## [37] "fjgeym"          "studull"         "stig10"          "ib1m2"          
## [41] "ib2m2"           "ib3m2"           "bilskurm2"       "bilgm2"         
## [45] "svalm2"          "geymm2"          "rism2"           "matssvaedi"     
## [49] "undirmatssvaedi" "fjibhaed"        "ibteg"           "sernotad"       
## [53] "fjolnotad"

head(variables_desc)

##          V1                            V2
## 1 rfastnum                    property id
## 2   kdagur                      salesdate
## 3  nuvirdi                          price
## 4 teg_eign   type of property (see below)
## 5     svfn                   city/town id
## 6   byggar                     year built

I discard variables that are irrelevant for the analysis - as guided by Oli.

NB: my assumption below is that fjbilsk is the fjbilast variable that Oli refers to.

Other discards:

Number of storage rooms should not be important, given we have storage room area variable. Hence discard number of storage rooms (fjgeym).

Property circumference variable (ummal) - I drop this on grounds that property area variable provides all the information that is needed here.

Same reasoning can be applied to variable ntm2 (netto area). The variable ibm2 (property area) should carry similar information.

Property id (rfastnum) is also not a meaningful explanatory variable, hence discarded.

ibteg and teg_eign carry exactly the same info: Type of property. So we discard ibteg

variables<-trimws(variables_desc[,1])
# Remove the useless spaces at the end that will create problem while matching.

variables %in% names(gagnasafn_ib_2018)

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE

# We assume: fjbilast = fjbilsk

gagnasafn_ib_2018<-dplyr::rename(gagnasafn_ib_2018, fjbilast = fjbilsk)

data<-gagnasafn_ib_2018 %>% dplyr::select(all_of(variables))



data<-dplyr::select(data, -fjgeym, -ummal, -rfastnum, -ntm2, - ibteg)

I discard real estates outside the capital area and I also create a new “year” and “month” variable.

data.rkv<-data %>% filter(svfn<2000)
data.rkv$svfn<-data.rkv$svfn %>% as.factor()

data.rkv$year<-substr(data.rkv$kdagur,7,10) %>% as.integer()

data.rkv$month<-substr(data.rkv$kdagur,4,5) %>% factor()
attributes(data.rkv$month)

## $levels
##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12"
## 
## $class
## [1] "factor"

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
levels(data.rkv$month)<-month_levels # rename the levels

dplyr::select(data.rkv, kdagur, year, month) %>% sample_n(6)

##       kdagur year month
## 1 01.01.2015 2015   Jan
## 2 04.10.2013 2013   Oct
## 3 05.05.2017 2017   May
## 4 11.12.2013 2013   Dec
## 5 03.06.2015 2015   Jun
## 6 24.09.2015 2015   Sep

# just to check that I have not messed up the date/month/year

data.rkv<-dplyr::select(data.rkv, -kdagur) # no need to keep kdagur at this point

We now check whether the variables are expressed with the appropriate class and change it if needed.

str(data.rkv) # to check if there is any variable that we need to change the class

## 'data.frame':    35131 obs. of  24 variables:
##  $ nuvirdi        : int  16400 27871 34592 24243 47331 11550 25175 25253 21181 25694 ...
##  $ teg_eign       : Factor w/ 12 levels "Einbýlishús",..: 6 6 6 6 1 6 6 6 6 6 ...
##  $ svfn           : Factor w/ 7 levels "0","1000","1100",..: 1 1 4 5 1 1 2 4 5 2 ...
##  $ byggar         : Factor w/ 149 levels "(null)","1787",..: 104 110 135 122 127 75 119 137 89 134 ...
##  $ efstah         : int  4 10 3 3 2 2 3 6 2 9 ...
##  $ fjmib          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ lyfta          : int  0 1 0 0 0 0 0 3 0 2 ...
##  $ ibm2           : num  57.7 72.8 116.8 96 205.4 ...
##  $ fjhaed         : int  1 1 1 1 3 1 1 1 2 1 ...
##  $ fjbilast       : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ fjbkar         : int  1 1 1 1 1 0 1 0 0 1 ...
##  $ fjsturt        : int  0 0 1 0 1 1 0 1 1 0 ...
##  $ fjklos         : int  1 1 1 1 2 1 1 2 1 1 ...
##  $ fjeld          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ fjherb         : int  1 2 3 3 5 1 2 1 3 3 ...
##  $ fjstof         : int  1 1 1 1 2 1 1 1 2 1 ...
##  $ stig10         : num  10 10 10 10 9.9 10 10 10 10 10 ...
##  $ bilskurm2      : Factor w/ 642 levels "-70","(null)",..: 3 3 3 3 564 3 134 3 3 3 ...
##  $ svalm2         : Factor w/ 723 levels "-0,2","(null)",..: 3 617 664 3 3 3 549 3 480 40 ...
##  $ geymm2         : Factor w/ 769 levels "(null)","0","0,1",..: 55 2 2 27 2 14 656 752 2 747 ...
##  $ matssvaedi     : int  161 72 700 620 290 100 330 510 680 340 ...
##  $ undirmatssvaedi: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ year           : int  2015 2015 2015 2014 2015 2012 2012 2012 2012 2012 ...
##  $ month          : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 2 11 12 1 12 1 1 1 1 ...

data.rkv$byggar<-data.rkv$byggar %>% as.character() %>% as.integer()


data.rkv$bilskurm2<-data.rkv$bilskurm2 %>% as.numeric()

data.rkv$svalm2<-data.rkv$svalm2 %>% as.numeric()
data.rkv$geymm2<-data.rkv$geymm2 %>% as.numeric()

data.rkv$matssvaedi<-data.rkv$matssvaedi %>% factor()

Now we are going to check the range of the variables.

We decide to focus on real estate only and discard hotels and workshops (“Hótelstarfsemi”, “Vinnustofa”) as they may obey to different market logic, and in order to narrow the focus of our investigation. It would be like to compare apples and oranges.

Herbergi and Séreign have been discarded as well since there are very few observations with such label and they are not well defined.

Fjölbýlishús, Íbúðareign and Íbúðarhús have been collapsed under level Íbúðareign since they all describe the same type of house.

The big majority of the observations have 0 as undirmatssvaedi (sub location of property) which does not indicate any specific sublocation. In our opinion this invalidate the whole variable and it is a valid reason to exclude it from the study.

data.rkv %>% summary()

##     nuvirdi                 teg_eign       svfn           byggar    
##  Min.   :  5900   Íbúðareign    :28776   0   :19957   Min.   :1841  
##  1st Qu.: 25200   Einbýlishús   : 3182   1000: 6090   1st Qu.:1963  
##  Median : 33152   Raðhús        : 2253   1100:  645   Median :1983  
##  Mean   : 36889   Parhús        :  780   1300: 2483   Mean   :1981  
##  3rd Qu.: 43969   Ósamþykkt íbúð:   66   1400: 4778   3rd Qu.:2003  
##  Max.   :956146   Íbúðarhús     :   47   1604: 1173   Max.   :2018  
##                   (Other)       :   27   1606:    5   NA's   :10    
##      efstah           fjmib           lyfta            ibm2           fjhaed   
##  Min.   : 1.000   Min.   :1.000   Min.   :0.000   Min.   : 17.7   Min.   :1.0  
##  1st Qu.: 2.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.: 74.7   1st Qu.:1.0  
##  Median : 3.000   Median :1.000   Median :0.000   Median : 95.4   Median :1.0  
##  Mean   : 3.401   Mean   :1.006   Mean   :0.433   Mean   :105.1   Mean   :1.2  
##  3rd Qu.: 4.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:122.0   3rd Qu.:1.0  
##  Max.   :13.000   Max.   :4.000   Max.   :8.000   Max.   :420.1   Max.   :4.0  
##                                                                                
##     fjbilast          fjbkar          fjsturt           fjklos     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.000  
##  Median :0.0000   Median :1.0000   Median :1.0000   Median :1.000  
##  Mean   :0.2052   Mean   :0.7711   Mean   :0.5569   Mean   :1.197  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##  Max.   :8.0000   Max.   :3.0000   Max.   :4.0000   Max.   :6.000  
##                                                                    
##      fjeld           fjherb           fjstof          stig10      
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.00   Min.   : 4.000  
##  1st Qu.:1.000   1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.:10.000  
##  Median :1.000   Median : 2.000   Median : 1.00   Median :10.000  
##  Mean   :1.005   Mean   : 2.498   Mean   : 1.24   Mean   : 9.972  
##  3rd Qu.:1.000   3rd Qu.: 3.000   3rd Qu.: 1.00   3rd Qu.:10.000  
##  Max.   :4.000   Max.   :24.000   Max.   :14.00   Max.   :10.000  
##                                                                   
##    bilskurm2          svalm2        geymm2        matssvaedi   
##  Min.   :  1.00   Min.   :  1   Min.   :  1.0   100    : 2314  
##  1st Qu.:  3.00   1st Qu.:  3   1st Qu.:  2.0   600    : 2218  
##  Median :  3.00   Median :110   Median :124.0   320    : 1829  
##  Mean   : 48.83   Mean   :241   Mean   :266.2   90     : 1714  
##  3rd Qu.:  3.00   3rd Qu.:481   3rd Qu.:586.0   20     : 1669  
##  Max.   :638.00   Max.   :723   Max.   :768.0   130    : 1466  
##                                                 (Other):23921  
##  undirmatssvaedi        year          month      
##  Min.   :  0.000   Min.   :2012   Nov    : 3254  
##  1st Qu.:  0.000   1st Qu.:2013   Mar    : 3194  
##  Median :  0.000   Median :2015   Oct    : 3161  
##  Mean   :  5.688   Mean   :2015   May    : 3095  
##  3rd Qu.:  0.000   3rd Qu.:2016   Dec    : 3006  
##  Max.   :107.000   Max.   :2018   Jun    : 2944  
##                                   (Other):16477

data.rkv$teg_eign %>% summary()

##    Einbýlishús   Fjölbýlishús       Gistihús       Herbergi Hótelstarfsemi 
##           3182             19              0              1              3 
##     Íbúðareign      Íbúðarhús Ósamþykkt íbúð         Parhús         Raðhús 
##          28776             47             66            780           2253 
##        Séreign     Vinnustofa 
##              2              2

data.rkv$teg_eign %>% attributes()

## $levels
##  [1] "Einbýlishús"    "Fjölbýlishús"   "Gistihús"       "Herbergi"      
##  [5] "Hótelstarfsemi" "Íbúðareign"     "Íbúðarhús"      "Ósamþykkt íbúð"
##  [9] "Parhús"         "Raðhús"         "Séreign"        "Vinnustofa"    
## 
## $class
## [1] "factor"

data.rkv$teg_eign<-fct_collapse(data.rkv$teg_eign,
  Commercial = c("Hótelstarfsemi", "Vinnustofa"),
  NotDefined = c("Herbergi", "Séreign"),
  Íbúðareign = c("Fjölbýlishús", "Íbúðareign", "Íbúðarhús")
)

data.rkv<-subset(data.rkv, teg_eign != "Commercial" & teg_eign != "NotDefined")

data.rkv$teg_eign<-droplevels(data.rkv$teg_eign)
# remove levels not present in the dataset

data.rkv$undirmatssvaedi %>% factor() %>% summary()

##     0     1     2     3     4     6     7     8     9    10    11    12    13 
## 29671   283    66   199    68    32    19    18     6     6     5   218    91 
##    14    15    16    17    18    20    21    22    23    24    25    26    27 
##    51   126    65   232   331   116   189   348    72   146   130    76    32 
##    28    31    32    33    34    35    36    37    38    39    40    41    42 
##   206    21     7     9    10    10    10    34    10     4    33     2     5 
##    43    44    45    46    47    48    49    50    51    52    53    54    55 
##    11   247    59   231    73    54   157    15    79   103     2    17    17 
##    56    57    59    60    61    62    63    64    65   101   102   103   104 
##    34    14   101    78    19   198    29    11    17    25    11    15   189 
##   105   106   107 
##   190    39   131

# the big majority of the observations have 0 as undirmatssvaedi (sub location of property) which does not indicate any specific sublocation
data.rkv<-dplyr::select(data.rkv, - undirmatssvaedi)


levels(data.rkv$svfn)

## [1] "0"    "1000" "1100" "1300" "1400" "1604" "1606"

data.rkv$svfn<-fct_recode(data.rkv$svfn,
  Reykjavík = "0",
  Kópavogur = "1000",
  Seltjarnarnes = "1100",
  Garðabær = "1300",
  Hafnarfjörður = "1400",
  Mosfellsbær = "1604",
  Kjósarhreppur= "1606"
)

levels(data.rkv$svfn)

## [1] "Reykjavík"     "Kópavogur"     "Seltjarnarnes" "Garðabær"     
## [5] "Hafnarfjörður" "Mosfellsbær"   "Kjósarhreppur"

Exploratory Data Analysis

Correlation matrix

data.rkv %>%
  ggcorr(label=T, label_size = 2)

As expected, pric (nuvirdi) is positively correlated to the apartment area expressed in square meters (ibm2).

Ibm2 is correlated to the number of rooms and toilets.

Year also is positively correlated with nuvirdi (price) - perhaps unsurprisingly as there was economic growth in each year in 2012-2018 (growing may stimulate demand, though of course house build - a supply factor - could affect price the other direction).

Additional analysis with GGPLOT

ggplot(data.rkv) +
  geom_boxplot(mapping=aes(x=teg_eign, y=nuvirdi))

filter(data, nuvirdi==max(nuvirdi)) # suspicious observation, nuvirdi is extremely high.

##       kdagur nuvirdi   teg_eign svfn byggar efstah fjmib lyfta  ibm2 fjhaed
## 1 27.05.2016  956146 Íbúðareign 1604   2016      3     1     1 110.5      1
##   fjbilast fjbkar fjsturt fjklos fjeld fjherb fjstof stig10 bilskurm2 svalm2
## 1        1      0       1      1     1      2      1     10         0   19,5
##   geymm2 matssvaedi undirmatssvaedi
## 1    4,5        850               0

# This really looks like an error, we exclude the observation. Data instead of data.rkv so we can see the value for the columns excluded as well. 

data.rkv<-data.rkv[-which.max(data.rkv$nuvirdi),]

ggplot(data.rkv) +
  geom_boxplot(mapping=aes(x=teg_eign, y=nuvirdi)) + 
  scale_y_continuous(labels = comma, limits=c(0, 200000)) + 
  labs(x="type of real estate", y="nuvirdi = price")

ggplot(data.rkv) + 
  geom_histogram(aes(x=nuvirdi)) + 
  scale_x_continuous(labels = comma, limits=c(0, 100000)) +
  labs(x="nuvirdi = price")

# distribution of nuvirdi seems slightly skewed

ggplot(data.rkv) + 
  geom_histogram(aes(x=log(nuvirdi))) + 
  scale_x_continuous(labels = comma) +
  labs(x="nuvirdi = price")

# distribution of nuvirdi looks more normal now.

ggplot(data.rkv) + 
  geom_density(aes(x=nuvirdi , col=teg_eign)) + 
  scale_x_continuous(labels = comma,limits=c(0, 100000)) +
  labs(x="nuvirdi = price", col="type of real estate")

This plot above shows that íbuð (appartments) and especially óasmþyktt íbuð tend to be much cheaper than other types Parhús, Raðhús and Einbýlishús.

ggplot(data.rkv) + 
  geom_density(aes(x=log(nuvirdi) , col=teg_eign)) + 
  scale_x_continuous(labels = comma) +
  labs(x="nuvirdi = price", col="type of real estate")

ggplot(data.rkv, mapping=aes(x=byggar, y=nuvirdi)) +
  geom_point(alpha = 1/10, position = "jitter") +
  geom_smooth() +
  scale_x_continuous(name="building year",
                   breaks=seq(0,2019,10)) +
  scale_y_continuous(labels = comma)

Here we see that there is a relatively flat relationship between price and building year.

data.rkv %>% ggplot() +
  geom_boxplot(mapping=aes(x=as.factor(year), y=nuvirdi)) + 
  labs(x="year", y="nuvirdi = price") +
  scale_y_continuous(labels = comma, limits=c(0, 250000))

data.rkv %>% ggplot() +
  geom_boxplot(mapping=aes(x=month, y=nuvirdi)) + 
  labs(y="nuvirdi = price") +
  scale_y_continuous(labels = comma, limits=c(0, 250000))

These chart suggest that “year” of purchase may be important determinant of price, though “month” less so.

data.rkv %>% group_by(year,teg_eign) %>%
  dplyr::summarise(avg_price=mean(nuvirdi)) %>% 
  ggplot() +
  geom_line(mapping=aes(x=year, y=avg_price, col=teg_eign)) +
  labs(col="type of real estate") +
  scale_x_continuous(breaks=seq(2000,2019,1))

This suggests that for the most part, prices may increase by year, but that in the final year the increase is less substantial, indeed prices plateau for einbýlishús in 2018 and a decrease for ósamþykkt íbúð in 2018.

ggplot(data.rkv,mapping = aes(x=ibm2,y=nuvirdi)) +
  geom_point(col="light green") +
  geom_smooth() + 
  scale_y_continuous(labels = comma) + 
  labs(x="property area [m^2]",col="type of real estate") + 
    facet_wrap(~teg_eign)

Here we see that ósamþykkt íbúð tend to be much smaller and cheaper, and that locations Hafnarfjórður and Mosfellsbær generally look cheaper per square meter than other areas.

ggplot(data.rkv,mapping = aes(x=ibm2,y=nuvirdi)) +
  geom_point(col="light green") +
  geom_smooth() + 
  scale_y_continuous(labels = comma) + 
  labs(x="property area [m^2]",col="type of real estate") + 
    facet_wrap(~svfn)

data.rkv %>% ggplot() +
  geom_boxplot(mapping=aes(x=svfn, y=nuvirdi)) + 
  labs(x="City ID", y="nuvirdi = price") +
  scale_y_continuous(labels = comma, limits=c(0, 250000))

The first box plot above shows the difference in prices in cities, while the second shows differences by neighbourhood matsvæði.

This shows that not all the neighbors are alike. This is reinforced by the plot below which compares the average price per square meter between neighbors, showing that some areas have a greater premium than other.

data.rkv %>% ggplot() +
  geom_boxplot(mapping=aes(x=matssvaedi, y=nuvirdi)) + 
  labs(x="Neighbor", y="nuvirdi = price") +
  scale_y_continuous(labels = comma, limits=c(0, 250000)) +
  theme(axis.text.x = element_text(angle = 90))

However, it does not seem to be convenient to keep all the levels of matssvaedi - it makes the read out of results very lengthy and cumbersome for the reader. For that reason, we create a new variable that bunches neighbourhoods into fourteen proximate groupings.

The plot here below compares the average price per square meter between neighborhoods. Intuitively we can say that some areas are more premium than other.

data.rkv %>% mutate(m2price=nuvirdi/ibm2) %>% group_by(matssvaedi, svfn) %>%
  dplyr::summarise(avg_m2price=mean(m2price)) %>% 
  ggplot() +
  geom_col(aes(x=matssvaedi, y=avg_m2price, fill=svfn)) +
  labs(x="Neighbor", y="Avg price per m2", fill="City") +
  scale_y_continuous(labels = comma) +
  theme(axis.text.x = element_text(angle = 90))

newneighborhoods<-read_excel("NewNeighbourhoods.xlsx") %>% data.frame()


newneighborhoods$Matsv<-newneighborhoods$Matsv %>% as.factor()

data.rkv<-left_join(data.rkv,newneighborhoods, by = c("matssvaedi"="Matsv"))
  
data.rkv %>% dplyr::select(svfn,matssvaedi,City,New.group.Number, New.Group.name) %>% sample_n(10) # run it few time just to check that the match was done correctly.

##             svfn matssvaedi          City New.group.Number
## 1  Hafnarfjörður        600 Hafnarfjorður               12
## 2      Reykjavík        100     Reykjavik                2
## 3      Reykjavík        200     Reykjavik                6
## 4       Garðabær        550      Garðabær               10
## 5      Reykjavík         80     Reykjavik                2
## 6      Reykjavík         80     Reykjavik                2
## 7      Reykjavík        130     Reykjavik                3
## 8  Hafnarfjörður        620 Hafnarfjorður               12
## 9      Reykjavík        150     Reykjavik                4
## 10     Reykjavík         11     Reykjavik                1
##                  New.Group.name
## 1                 Hafnarfjorður
## 2         Reykjavik Near centre
## 3         Reykjavik City limits
## 4                      Garðabær
## 5         Reykjavik Near centre
## 6         Reykjavik Near centre
## 7  Reykjavik Grafarvogur and on
## 8                 Hafnarfjorður
## 9    Reykjavik Breiðholt and on
## 10            Reykjavik Central

# we can drop svfn, matssvaedi, city and New.group.Number and keep only New.Group.name


data.rkv<-data.rkv %>% dplyr::select(-svfn,-matssvaedi,-City,-New.group.Number,)
data.rkv$New.Group.name %>% class()

## [1] "character"

data.rkv$New.Group.name<-as.factor(data.rkv$New.Group.name)

levels(data.rkv$New.Group.name)

##  [1] "Garðabær"                     "Garðabær Peninsula"          
##  [3] "Hafnarfjorður"                "Kjosahreppur"                
##  [5] "Kopavogur Heart of"           "Kopavogur Inland"            
##  [7] "Mosfellsbær"                  "Reykjavik Breiðholt and on"  
##  [9] "Reykjavik Central"            "Reykjavik City limits"       
## [11] "Reykjavik Grafarvogur and on" "Reykjavik Near centre"       
## [13] "Reykjavik Near Esja"          "Seltjarnarnes"

data.rkv %>% mutate(m2price=nuvirdi/ibm2) %>% group_by(New.Group.name) %>%
  dplyr::summarise(avg_m2price=mean(m2price)) %>% 
  ggplot() +
  geom_col(aes(x=New.Group.name, y=avg_m2price, fill=New.Group.name)) +
  labs(x="New Neighborhoods", y="Avg m2 price") +
  scale_y_continuous(labels = comma) +
  theme(axis.text.x = element_text(angle = 90),legend.position = "none")

The exploratory data analysis could go on and on but, since we have decided to use tree based models, we decided that we can move to the next step.

Tree

data.rkv<-dplyr::rename(data.rkv, new.area = New.Group.name)
# I'am renaming it simpy because in the tree model it looks better.
data.rkv4tree<-na.omit(data.rkv)

set.seed(4444)
train.index<-sample(1:nrow(data.rkv4tree), nrow(data.rkv4tree)*0.8)
train.set<-data.rkv4tree[train.index,]
test.set<-data.rkv4tree[-train.index,]

tree.reg_rp<-rpart(nuvirdi~. , train.set, method = 'anova')

yhat<-predict(tree.reg_rp, newdata = test.set)

testMSE<-mean((yhat-test.set$nuvirdi)^2)
sqrt(testMSE) # most probably the accuracy using the tree model is lower than other regression model

## [1] 9786.866

# MSE is used as an estimator for sigma^2

rpart.plot(tree.reg_rp, faclen=2)

ibm2 appears on many nodes of the tree, this might indicate that it size of the house is an important indicator when predicting the sale price.

This is in line with the correlation matrix we plotted earlier.

Sales year and location of the real estate (area), is also another important variable when predicting sales price.

Random Forest

Bagging and random forest grow many trees, each tree is grown based on a bootstrapped data frame. We do this in order to reduce the variance of the basic tree model (high variance). So basically we grow n tree each tree gives me a prediction. In order to know the estimated prediction error, for each ith obs, we use all the predictions whose tree was grown without that ith obs. Bootstrapping uses replacement therefore it has been observed that on average bootstrapping sample 2/3 of the observation. This means that for each obs, we can use approx n tree/3 predictions. We average these n tree/3 and we end up having one prediction for each obs. The prediction is then compared to the response in the test data set. Now we can calculate the MSE.

do.trace = T prints the intermediate OOB MSE which help us to set the number of tree that it is sensible to grow.

OOB MSE: it is the predicted error estimated on the out of bag obs.

p<-ncol(data.rkv4tree)-1
m<-sqrt(p) %>% ceiling()
m.val<-c(m, m+1, m+2, m+3)

test.rmse<-rep(NA,length(m.val))

for(i in seq_along(m.val)){
rf.fit<-randomForest(x=train.set[,-1],
                     y=train.set[,1],
                     ntree = 150,
                     mtry = m.val[i],
                     do.trace = F,
                     importance = T)

test.pred = predict(rf.fit, newdata = test.set)
test.rmse[i] = mean((test.set$nuvirdi - test.pred)^2) %>% sqrt()
}

ggplot() + geom_col(mapping=aes(x=m.val,y=test.rmse)) + 
  labs(x="m value", y="RMSE")

best.m<-m.val[which.min(test.rmse)]

# The higher the ntree the lower the variance of the prediction

# rf2$mse: vector of mean square errors: sum of squared residuals divided by n

The plot here above shows how RMSE changes depending on m. is a number specifying the size of the random subset of variables considered at each split.

Apparently larger of the square root of the number of predictors return slightly better results.

varImpPlot(rf.fit, main = "Importance")

rmse.rf<-min(test.rmse)
rmse.rf

## [1] 5577.282

The graph created with varImpPlot() shows the importance of each variables. We get similar results as the simple tree grown earlier: year (year of sale), ibm2 (size), new.area (neighborhood) and byggar are the most important predictor for nuvirdi.

RandomForest is quite expensive computationally speaking, the package Ranger fit a model in a much faster way. This allows us to grow more trees.

Ranger

rf.ranger<-ranger(x=train.set[,-1],
                  y=train.set[,1],
                  num.trees = 5000,
                  mtry = m+3,
                  importance = "permutation",
                  verbose = F)


yhat.rng<-predict(rf.ranger, data = test.set)
mse.rf.rng<-mean((yhat.rng$predictions - test.set$nuvirdi)^2)
rmse.ranger<-mse.rf.rng %>% sqrt()
rmse.ranger

## [1] 5858.928

Let’s try to use RandomForest on a subset of variables only, the most important ones. Reducing the variable space should dramatically reduce the computational time, therefore we can increase the number of tree grown.

However this approach doesn’t seem to pay off. MSE is higher than the one we got using all the variables.

Randoom Forest with selected variables

train.set.snk<-dplyr::select(train.set, nuvirdi, year, ibm2, byggar,teg_eign, bilskurm2)

names(train.set.snk)

## [1] "nuvirdi"   "year"      "ibm2"      "byggar"    "teg_eign"  "bilskurm2"

dim(train.set.snk)

## [1] 28089     6

rf.fit.snk<-randomForest(x=train.set.snk[,-1],
                     y=train.set.snk[,1],
                     ntree=2000,
                     do.trace=F,
                     importance = T)

yhat.rf.snk<-predict(rf.fit.snk, newdata = test.set)
mse.rf.snk<-mean((yhat.rf.snk - test.set$nuvirdi)^2)
rmse.rf.snk<-mse.rf.snk %>% sqrt()
rmse.rf.snk

## [1] 8109.159

Boosting

Here with boosting we don’t grow every tree based on a bootstrapped data frame, instead we are growing trees sequentially, in the sense that we fit a tree based on the residuals from the previous tree. We have 3 main parameters to set when using boosting with tree models.

B: number of trees to be grown

d: split for each tree grown. d is often small (1,..5)

\(\lambda\): learning rate, which basically set the weight on the new residual prediction which will update the response prediction. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. Very small \(\lambda\) can require using a very large value of B in order to achieve good performance.

\[ f(x) ← \hat{f}(x) + λ\hat{f}^b(x)\]

Boosting with Caret

# BOOSTING WITH CARET with tuning
train.ctrl<-trainControl(method = "repeatedcv",
                         number = 5,
                         repeats = 3,
                         search="grid",
                         verboseIter = F)

myGrid<-expand.grid(n.trees = 150,
                    interaction.depth = c(1,3,5),
                    shrinkage = c(0.001, 0.05, 0.15, 0.3),
                    n.minobsinnode = 10)
set.seed(4444)
gbm.caret.tune<-train(nuvirdi~ .,
                      data = train.set,
                      method = "gbm",
                      #distribution = "gaussian",
                      #metric=metric,
                      trControl = train.ctrl,
                      verbose = F,
                      tuneGrid=myGrid)

print(gbm.caret.tune)

## Stochastic Gradient Boosting 
## 
## 28089 samples
##    21 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 22472, 22472, 22472, 22470, 22470, 22472, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  RMSE       Rsquared   MAE      
##   0.001      1                  16352.327  0.4422578  11688.169
##   0.001      3                  16051.196  0.6009087  11460.462
##   0.001      5                  15904.413  0.6527771  11340.201
##   0.050      1                   8827.950  0.7625360   5518.930
##   0.050      3                   7150.959  0.8336398   4382.220
##   0.050      5                   6666.950  0.8536838   4072.058
##   0.150      1                   7544.895  0.8124846   4503.224
##   0.150      3                   6373.239  0.8647891   3864.279
##   0.150      5                   6062.805  0.8775310   3652.632
##   0.300      1                   7406.800  0.8172655   4474.819
##   0.300      3                   6189.903  0.8721890   3758.934
##   0.300      5                   6076.789  0.8767428   3650.214
## 
## Tuning parameter 'n.trees' was held constant at a value of 150
## Tuning
##  parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  5, shrinkage = 0.15 and n.minobsinnode = 10.

plot(gbm.caret.tune)

From the plot we can see that boosting with 3 and 5 nodes yields similar results.

myGrid.best<-gbm.caret.tune$bestTune
myGrid.best

##   n.trees interaction.depth shrinkage n.minobsinnode
## 9     150                 5      0.15             10

set.seed(1)
gbm.caret.best<-train(nuvirdi~ .,
                      data=train.set,
                      method = "gbm",
                      distribution="gaussian",
                      trControl = train.ctrl,
                      verbose = F,
                      tuneGrid=myGrid.best)


test.pred<-predict(gbm.caret.best, newdata = test.set)
boostcaret.test.mse<-mean((test.set$nuvirdi - test.pred)^2)
rmse.boost.caret<-boostcaret.test.mse %>% sqrt()

rmse.boost.caret

## [1] 5836.498

summary(gbm.caret.best)

##                                                                       var
## ibm2                                                                 ibm2
## year                                                                 year
## byggar                                                             byggar
## new.areaReykjavik Central                       new.areaReykjavik Central
## fjklos                                                             fjklos
## bilskurm2                                                       bilskurm2
## fjbilast                                                         fjbilast
## geymm2                                                             geymm2
## new.areaReykjavik Breiðholt and on     new.areaReykjavik Breiðholt and on
## new.areaSeltjarnarnes                               new.areaSeltjarnarnes
## fjsturt                                                           fjsturt
## efstah                                                             efstah
## new.areaHafnarfjorður                               new.areaHafnarfjorður
## teg_eignÍbúðareign                                     teg_eignÍbúðareign
## new.areaReykjavik Near centre               new.areaReykjavik Near centre
## fjherb                                                             fjherb
## fjhaed                                                             fjhaed
## fjstof                                                             fjstof
## stig10                                                             stig10
## lyfta                                                               lyfta
## new.areaMosfellsbær                                   new.areaMosfellsbær
## svalm2                                                             svalm2
## new.areaReykjavik Near Esja                   new.areaReykjavik Near Esja
## fjmib                                                               fjmib
## new.areaReykjavik City limits               new.areaReykjavik City limits
## new.areaGarðabær Peninsula                     new.areaGarðabær Peninsula
## new.areaReykjavik Grafarvogur and on new.areaReykjavik Grafarvogur and on
## monthDec                                                         monthDec
## fjeld                                                               fjeld
## monthFeb                                                         monthFeb
## new.areaKopavogur Inland                         new.areaKopavogur Inland
## teg_eignÓsamþykkt íbúð                             teg_eignÓsamþykkt íbúð
## new.areaKopavogur Heart of                     new.areaKopavogur Heart of
## monthOct                                                         monthOct
## monthNov                                                         monthNov
## monthMar                                                         monthMar
## teg_eignRaðhús                                             teg_eignRaðhús
## fjbkar                                                             fjbkar
## monthSep                                                         monthSep
## teg_eignParhús                                             teg_eignParhús
## monthApr                                                         monthApr
## monthMay                                                         monthMay
## monthJun                                                         monthJun
## monthJul                                                         monthJul
## monthAug                                                         monthAug
## new.areaKjosahreppur                                 new.areaKjosahreppur
##                                          rel.inf
## ibm2                                 63.09222080
## year                                 15.09547925
## byggar                                4.41394645
## new.areaReykjavik Central             3.58738340
## fjklos                                2.64111396
## bilskurm2                             2.49439761
## fjbilast                              0.99756035
## geymm2                                0.92120740
## new.areaReykjavik Breiðholt and on    0.86523284
## new.areaSeltjarnarnes                 0.72579048
## fjsturt                               0.67401911
## efstah                                0.66662825
## new.areaHafnarfjorður                 0.66402937
## teg_eignÍbúðareign                    0.51578264
## new.areaReykjavik Near centre         0.34117072
## fjherb                                0.29027142
## fjhaed                                0.26722034
## fjstof                                0.19534434
## stig10                                0.16637417
## lyfta                                 0.16328227
## new.areaMosfellsbær                   0.14365699
## svalm2                                0.13221747
## new.areaReykjavik Near Esja           0.12041416
## fjmib                                 0.11149864
## new.areaReykjavik City limits         0.09660882
## new.areaGarðabær Peninsula            0.07314629
## new.areaReykjavik Grafarvogur and on  0.06542908
## monthDec                              0.06300707
## fjeld                                 0.05099986
## monthFeb                              0.05067215
## new.areaKopavogur Inland              0.04933845
## teg_eignÓsamþykkt íbúð                0.04658538
## new.areaKopavogur Heart of            0.04326940
## monthOct                              0.03498325
## monthNov                              0.03147157
## monthMar                              0.03017410
## teg_eignRaðhús                        0.02917988
## fjbkar                                0.02680086
## monthSep                              0.02209144
## teg_eignParhús                        0.00000000
## monthApr                              0.00000000
## monthMay                              0.00000000
## monthJun                              0.00000000
## monthJul                              0.00000000
## monthAug                              0.00000000
## new.areaKjosahreppur                  0.00000000

Comparing the models

tree_comp<-rbind(random_forest = rmse.rf,
           random_forest_ranger = rmse.ranger,
           random_forest_shrunk = rmse.rf.snk,
           boosting_caret = rmse.boost.caret) %>% data.frame()

colnames(tree_comp)<-"Estimated_prediction_error"
  
kable(tree_comp) %>% 
  kable_styling(bootstrap_options = "striped", full_width = T)

	Estimated_prediction_error
random_forest	5577.282
random_forest_ranger	5858.928
random_forest_shrunk	8109.159
boosting_caret	5836.498

Random Forest seems to be the most accurate model as it returns the lowest MSE/RMSE. It is worth noticing that ranger package, even though is not as accurate as the standard RandomForest function, ranks as the second most accurate method but it handles a “big” variable space in a much better way then the other algorithms.

Boosting might be better than random forest but needs extensive tuning, which corresponds to many possible model configurations.

We conclude that the best model found can predict sale price with a standard error equal to 5577.2815323 [k ISK].

Final project - Applied Data Analysis

Andrea

08/04/2020