Data Project: Melbourne Housing Market

Packages
1. Business Understanding
2. Data Understanding
- 2.1 Data Cleaning
- 2.2 Data Visualization
3. Modeling
4. Model Evaluation
5. Conclusion and application

Packages

Throughout our analysis, we make use of the following packages.

if (!require(SmartEDA)) install.packages("SmartEDA"); library(SmartEDA)
if (!require(corrplot)) install.packages("corrplot"); library(corrplot)
if (!require(dplyr)) install.packages("dplyr"); library(dplyr)
if (!require(tidyr)) install.packages("tidyr"); library(tidyr)
if (!require("knitr")) install.packages("knitr"); library("knitr")
if (!require("ggplot2")) install.packages("ggplot2"); library("ggplot2")
if (!require("leaflet")) install.packages("leaflet"); library("leaflet")
if (!require("shiny")) install.packages("shiny"); library("shiny")
if (!require("shinythemes")) install.packages("shinythemes"); library("shinythemes")
if (!require("stargazer")) install.packages("stargazer"); library("stargazer")
if (!require("rpart")) install.packages("rpart"); library("rpart")
if (!require("partykit")) install.packages("partykit"); library("partykit")
if (!require("ranger")) install.packages("ranger"); library("ranger")
if (!require("colorspace")) install.packages("colorspace"); library("colorspace")

1. Business Understanding

We defined our research question as the following: “How can we accurately predict the sales price of properties in Melbourne using available housing features? Which model provides the most accurate predictions?”

These are highly relevant and powerful information. To make things more tangible, we thought of ourselves as the founders of “InnovEstate Investors” - a real estate business located in Melbourne. We are not just a normal real estate business. Our company aims to employ statistical methods in order to gain a competitive advantage over our competitors. But how do we do this?

We use historical data of properties sold in Melbourne. In order to predict the price, we train our models to derive a fair market value of properties that enter the Melbourne real estate market on the basis of a given set of features about that property. How many rooms does the property have? What is its distance to the city center? Given these two and some further information (see below) we are able to give a estimate of a fair market price of the new property.

In our investment decisions, the estimates of the models are one of the most relevant influences. When a new property enters the market, we simply enter its features into our models. Next, we compare the predicted price of our models to the sales price of the property. By this, we not only can only optimize our new purchases but also can test our portfolio. We use the models to anticipate the price developments of our properties and can also base selling decisions on that.

Let us have a look at the process that we employed in order to find appropriate models for that use-case.

2. Data Understanding

During our search for data on the Melbourne housing market, we came across the data set “Melbourne Housing Snapshot”, available under https://www.kaggle.com/datasets/dansbecker/melbourne-housing-snapshot.

Here is a short description for each variable.

Suburb: The suburb where the property is located. Address: The street address of the property. Rooms: The number of rooms in the property. Type: The type of property (e.g., house, townhouse, unit) distinguished by specific codes like ‘h’ for house and ‘u’ for unit. Price: The sale price of the property in dollars. Method: The sale method, with codes like ‘S’ for sold, ‘PI’ for passed in at auction, among others, detailing how the property was sold. SellerG: The real estate agent or group responsible for selling the property. Date: The date on which the property was sold. Distance: The property’s distance from the Central Business District (CBD) in kilometers. Postcode: The postal code where the property is located. Bedroom2: The number of bedrooms in the property, as scraped from a different source. Bathroom: The number of bathrooms in the property. Car: The number of car spaces available at the property. Landsize: The size of the land on which the property is built, measured in square meters. BuildingArea: The building size in square meters. YearBuilt: The year in which the property was built. CouncilArea: The local government area (council) that governs the property. Lattitude: The geographical latitude of the property. Longtitude: The geographical longitude of the property. Regionname: The general region where the property is located, such as ‘Western Metropolitan’ or ‘Eastern Metropolitan’

We upload the required data set about the Melbourne housing market.

housing <- read.csv("melb_data.csv", stringsAsFactors=TRUE)

We can now start to analyze our data set. In particular, we are interested in the data type of our variables.

dim(housing)

## [1] 13580    21

head(housing)

str(housing)

## 'data.frame':    13580 obs. of  21 variables:
##  $ Suburb       : Factor w/ 314 levels "Abbotsford","Aberfeldie",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Address      : Factor w/ 13378 levels "1 Adelle Ct",..: 12795 5944 9815 9005 10590 2196 2143 13336 11083 1091 ...
##  $ Rooms        : int  2 2 3 3 4 2 3 2 1 2 ...
##  $ Type         : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 1 3 1 ...
##  $ Price        : num  1480000 1035000 1465000 850000 1600000 ...
##  $ Method       : Factor w/ 5 levels "PI","S","SA",..: 2 2 4 1 5 2 2 2 2 2 ...
##  $ SellerG      : Factor w/ 268 levels "@Realty","Abercromby's",..: 24 24 24 24 165 114 165 165 24 24 ...
##  $ Date         : Factor w/ 58 levels "1/07/2017","10/09/2016",..: 46 48 49 49 50 53 53 57 57 57 ...
##  $ Distance     : num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : num  3067 3067 3067 3067 3067 ...
##  $ Bedroom2     : num  2 2 3 3 3 2 4 2 1 3 ...
##  $ Bathroom     : num  1 1 2 2 1 1 2 1 1 1 ...
##  $ Car          : num  1 0 0 1 2 0 0 2 1 2 ...
##  $ Landsize     : num  202 156 134 94 120 181 245 256 0 220 ...
##  $ BuildingArea : num  NA 79 150 NA 142 NA 210 107 NA 75 ...
##  $ YearBuilt    : num  NA 1900 1900 NA 2014 ...
##  $ CouncilArea  : Factor w/ 34 levels "","Banyule","Bayside",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ Lattitude    : num  -37.8 -37.8 -37.8 -37.8 -37.8 ...
##  $ Longtitude   : num  145 145 145 145 145 ...
##  $ Regionname   : Factor w/ 8 levels "Eastern Metropolitan",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Propertycount: num  4019 4019 4019 4019 4019 ...

summary(housing)

##             Suburb                  Address          Rooms        Type    
##  Reservoir     :  359   1/1 Clarendon St:    3   Min.   : 1.000   h:9449  
##  Richmond      :  260   13 Robinson St  :    3   1st Qu.: 2.000   t:1114  
##  Bentleigh East:  249   14 Arthur St    :    3   Median : 3.000   u:3017  
##  Preston       :  239   2 Bruce St      :    3   Mean   : 2.938           
##  Brunswick     :  222   28 Blair St     :    3   3rd Qu.: 3.000           
##  Essendon      :  220   36 Aberfeldie St:    3   Max.   :10.000           
##  (Other)       :12031   (Other)         :13562                            
##      Price         Method             SellerG             Date      
##  Min.   :  85000   PI:1564   Nelson       :1565   27/05/2017:  473  
##  1st Qu.: 650000   S :9022   Jellis       :1316   3/06/2017 :  395  
##  Median : 903000   SA:  92   hockingstuart:1167   12/08/2017:  387  
##  Mean   :1075684   SP:1703   Barry        :1011   17/06/2017:  374  
##  3rd Qu.:1330000   VB:1199   Ray          : 701   27/11/2016:  362  
##  Max.   :9000000             Marshall     : 659   29/07/2017:  341  
##                              (Other)      :7161   (Other)   :11248  
##     Distance        Postcode       Bedroom2         Bathroom    
##  Min.   : 0.00   Min.   :3000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 6.10   1st Qu.:3044   1st Qu.: 2.000   1st Qu.:1.000  
##  Median : 9.20   Median :3084   Median : 3.000   Median :1.000  
##  Mean   :10.14   Mean   :3105   Mean   : 2.915   Mean   :1.534  
##  3rd Qu.:13.00   3rd Qu.:3148   3rd Qu.: 3.000   3rd Qu.:2.000  
##  Max.   :48.10   Max.   :3977   Max.   :20.000   Max.   :8.000  
##                                                                 
##       Car           Landsize         BuildingArea     YearBuilt   
##  Min.   : 0.00   Min.   :     0.0   Min.   :    0   Min.   :1196  
##  1st Qu.: 1.00   1st Qu.:   177.0   1st Qu.:   93   1st Qu.:1940  
##  Median : 2.00   Median :   440.0   Median :  126   Median :1970  
##  Mean   : 1.61   Mean   :   558.4   Mean   :  152   Mean   :1965  
##  3rd Qu.: 2.00   3rd Qu.:   651.0   3rd Qu.:  174   3rd Qu.:1999  
##  Max.   :10.00   Max.   :433014.0   Max.   :44515   Max.   :2018  
##  NA's   :62                         NA's   :6450    NA's   :5375  
##         CouncilArea     Lattitude        Longtitude   
##               :1369   Min.   :-38.18   Min.   :144.4  
##  Moreland     :1163   1st Qu.:-37.86   1st Qu.:144.9  
##  Boroondara   :1160   Median :-37.80   Median :145.0  
##  Moonee Valley: 997   Mean   :-37.81   Mean   :145.0  
##  Darebin      : 934   3rd Qu.:-37.76   3rd Qu.:145.1  
##  Glen Eira    : 848   Max.   :-37.41   Max.   :145.5  
##  (Other)      :7109                                   
##                       Regionname   Propertycount  
##  Southern Metropolitan     :4695   Min.   :  249  
##  Northern Metropolitan     :3890   1st Qu.: 4380  
##  Western Metropolitan      :2948   Median : 6555  
##  Eastern Metropolitan      :1471   Mean   : 7454  
##  South-Eastern Metropolitan: 450   3rd Qu.:10331  
##  Eastern Victoria          :  53   Max.   :21650  
##  (Other)                   :  73

2.1 Data Cleaning

We abbreviate the Regionname as this is beneficial to our visualizations in the following.

# Original region names
region_names <- c("Northern Metropolitan", "Western Metropolitan", 
                  "Southern Metropolitan", "Eastern Metropolitan", 
                  "South-Eastern Metropolitan", "Northern Victoria", 
                  "Eastern Victoria", "Western Victoria")

# Assign codes to each region
region_codes <- c("nM", "wM", "sM", "eM", "seM", "nV", "eV", "wV")

# Create a named vector for mapping
region_map <- setNames(region_codes, region_names)

# Replace the Regionname in the housing dataframe
housing$Regionname <- region_map[housing$Regionname]

# Show the unique values of Regionname after replacement
unique(housing$Regionname)

## [1] "sM"  "eV"  "nV"  "nM"  "seM" "wM"  "eM"  "wV"

The column Date is of type character and can be transformed into a variable of type Date.

housing$Date <- as.Date(housing$Date, format = "%d/%m/%Y")
range(housing$Date)

## [1] "2016-01-28" "2017-09-23"

str(housing$Date)

##  Date[1:13580], format: "2016-12-03" "2016-02-04" "2017-03-04" "2017-03-04" "2016-06-04" ...

We change the postcodes into factors because postcodes are categorical, not numerical. This conversion ensures that the model treats them as distinct categories without any implied numerical relationships, preventing misinterpretation and facilitating accurate analysis.

housing$Postcode <- as.factor(housing$Postcode)

Next, we want to find out whether our data set contains missing values.

anyNA(housing)

## [1] TRUE

sum(is.na(housing))

## [1] 11887

Since we generally observe a large number of missing values, we would like to have a closer look at these NAs in order to find out how to deal with this issue in our subsequent analysis.

# First look with the SmartEDa package
ExpData(housing,type=2)

# Percentage of observation with NAs

na_rows_count <- sum(!complete.cases(housing))

cat(round(na_rows_count / nrow(housing),3)*100,
    "% of observations has at least one NA in it. \n")

## 49.7 % of observations has at least one NA in it.

Looking at the housing data set also reveals some unexpected 0-values. We want to understand how many we have as well as proceed our analysis of the NAs in the housing set.

# Number of NAs per variable
colSums(is.na(housing))

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0             0 
##        Method       SellerG          Date      Distance      Postcode 
##             0             0             0             0             0 
##      Bedroom2      Bathroom           Car      Landsize  BuildingArea 
##             0             0            62             0          6450 
##     YearBuilt   CouncilArea     Lattitude    Longtitude    Regionname 
##          5375             0             0             0             0 
## Propertycount 
##             0

# Looking at the housing data there are some unexpected 0s value
colSums(housing ==0)

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0             0 
##        Method       SellerG          Date      Distance      Postcode 
##             0             0             0             6             0 
##      Bedroom2      Bathroom           Car      Landsize  BuildingArea 
##            16            34            NA          1939            NA 
##     YearBuilt   CouncilArea     Lattitude    Longtitude    Regionname 
##            NA             0             0             0             0 
## Propertycount 
##             0

length(which( (housing %>% select(-Car,-Bedroom2, -Distance ) == 0 ) == TRUE ))

## [1] 1990

For the variable Car the number of NAs is rather low and neglectable. In this case, we could just drop the NAs. However, for the variables BuildingArea and YearBuilt we do observe a significant number of NAs. Is there some kind of pattern?

Before dropping the NAs we have to consider if the entries are missing at random, or if using other type of solution like, for example, substituting with the mean value would make more sense for the goal of the project.

We see a significant number of unexpected 0-values that correspond to no value registered for certain entries. At this step, we disregard 0-values for the Car variable as this indicates that there are no parking spaces available. Same for Bedroom2 and Distance, where we can expect a property with no bedrooms (like a studio flat) or properties situated right in CBD.

NAs Pattern

In trying to answer the question whether there is some kind of pattern of NAs, we start with a first look at the distribution of the central variable of our project, price, in the NAs case and in the case where entries are present.

df_nas <- housing %>%
  select(Price, BuildingArea, YearBuilt)

# Identify rows where Landsize or BuildingArea is zero
nas_rows <- is.na(df_nas$BuildingArea) | is.na(df_nas$YearBuilt)

# Extract prices for those rows
price_nas <- df_nas$Price[nas_rows]

# Extract indices of zero rows
n <- which(nas_rows)

price_not_nas <- df_nas$Price[-n]


hist(df_nas$Price, main="Price distribution",
     xlab="Price",freq= FALSE, col= "blue")
abline(v=mean(df_nas$Price), col="red")

par(mfrow= c(1,2))

hist(price_not_nas, main="Price distribution (not nas)",
     xlab="Price",freq= FALSE, col= "lightblue")
abline(v=mean(price_not_nas), col="red")
hist(price_nas, main="Price distribution (nas)",
     xlab="Price",freq= FALSE, col= "lightblue", xlim = c(0, 8*10^6))
abline(v=mean(price_nas), col="red")

# Closer look at the distribution
summary(price_not_nas)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  131000  630000  892250 1078946 1335000 9000000

summary(price_nas)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   85000  665000  916000 1072356 1325000 5800000

We can see that the distribution seems very similar.

We can proceed to visually compare the distribution of the variables within the cleaned data set - the one where we removed all observations containing NAs - to that of the variables within the original, uncleaned data set.

par(mfrow = c(5,3))

unclean_housing <- housing
housing <- na.omit(unclean_housing)
na_housing <- unclean_housing[rowSums(is.na(unclean_housing)) > 0, ]


for (i in 1:ncol(housing)) {
  if (is.numeric(housing[, i]) && is.numeric(na_housing[, i])) {
    boxplot(list(Clean=housing[, i], Unclean=na_housing[, i]),
            main=paste("Boxplot for Column", colnames(housing)[i]),
            col= c("lightblue", "orange"),
            outline = FALSE)
    legend("topright", legend=c("Clean", "Unclean"), fill=c("lightblue",
                                                            "orange"))
  }
}

We observe that the distributions for the variables are very similar when we compare the distribution of the data with complete rows to the data where at least one NA is found in the row. In the boxplots we excluded outliers to better visualize that the distributions are relatively robust. Therefore, we proceed only with the clean data as it still fulfills the size requirements.

anyNA(housing)

## [1] FALSE

Are there still unexpected 0s value?

#Looking at the housing there are some unexpected 0s value
length(which( (housing %>% select(-Car) == 0 ) == TRUE ))

## [1] 1041

colSums(housing %>% select(-Car) == 0)

##        Suburb       Address         Rooms          Type         Price 
##             0             0             0             0             0 
##        Method       SellerG          Date      Distance      Postcode 
##             0             0             0             4             0 
##      Bedroom2      Bathroom      Landsize  BuildingArea     YearBuilt 
##             5             0          1015            17             0 
##   CouncilArea     Lattitude    Longtitude    Regionname Propertycount 
##             0             0             0             0             0

#We drop "BuildingArea", no reason to be 0 and small n of observations

housing<- housing %>% 
  filter(BuildingArea !=0)

The 0-values in the variables Distance and Bedroom2 seem plausible, e.g. if we consider one-room apartments and properties located in CBD.

Do 0s values for Landsize make sense? Since no further information is given in the metadata, we assume that land size considers the land around the building. For some properties, there is no private or shared land. We therefore expect properties of type unit to have the most 0 values. Let’s look at the different property types with 0s for the variable Landsize and check our assumption.

#Closer look at land size 0s value per type 
land_size <- housing %>%
  select(Type, Landsize) %>%
  filter(Landsize == 0) %>%
  group_by(Type) %>%
  summarize(n=n()) %>%
  arrange(desc(n))

land_size

From the metadata “h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse” so we expected units to have the most 0-values in land size. The output supports our assumption and we therefore consider 0-values for the variable Landsize plausible.

We are now presented with a cleaned final version of our data set that can be used as an adequate representation of the Melbourne Housing Market.

str(housing)

## 'data.frame':    6813 obs. of  21 variables:
##  $ Suburb       : Factor w/ 314 levels "Abbotsford","Aberfeldie",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Address      : Factor w/ 13378 levels "1 Adelle Ct",..: 5944 9815 10590 2143 13336 1091 9030 2126 3288 9212 ...
##  $ Rooms        : int  2 3 4 3 2 2 3 2 2 3 ...
##  $ Type         : Factor w/ 3 levels "h","t","u": 1 1 1 1 1 1 1 3 1 1 ...
##  $ Price        : num  1035000 1465000 1600000 1876000 1636000 ...
##  $ Method       : Factor w/ 5 levels "PI","S","SA",..: 2 4 5 2 2 2 5 2 2 2 ...
##  $ SellerG      : Factor w/ 268 levels "@Realty","Abercromby's",..: 24 24 165 165 165 24 165 24 114 114 ...
##  $ Date         : Date, format: "2016-02-04" "2017-03-04" ...
##  $ Distance     : num  2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 2.5 ...
##  $ Postcode     : Factor w/ 198 levels "3000","3002",..: 54 54 54 54 54 54 54 54 54 54 ...
##  $ Bedroom2     : num  2 3 3 4 2 3 3 2 2 3 ...
##  $ Bathroom     : num  1 2 1 2 1 1 2 2 1 2 ...
##  $ Car          : num  0 0 2 0 2 2 2 1 2 1 ...
##  $ Landsize     : num  156 134 120 245 256 220 214 0 238 113 ...
##  $ BuildingArea : num  79 150 142 210 107 75 190 94 97 110 ...
##  $ YearBuilt    : num  1900 1900 2014 1910 1890 ...
##  $ CouncilArea  : Factor w/ 34 levels "","Banyule","Bayside",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ Lattitude    : num  -37.8 -37.8 -37.8 -37.8 -37.8 ...
##  $ Longtitude   : num  145 145 145 145 145 ...
##  $ Regionname   : chr  "sM" "sM" "sM" "sM" ...
##  $ Propertycount: num  4019 4019 4019 4019 4019 ...
##  - attr(*, "na.action")= 'omit' Named int [1:6750] 1 4 6 9 11 14 15 19 22 23 ...
##   ..- attr(*, "names")= chr [1:6750] "1" "4" "6" "9" ...

2.2 Data Visualization

We proceed by creating further visualization to better understand the housing data set.

Let’s start by having a closer look at the number of distinct values per variable. This will play a crucial role later.

ExpData(housing,type=2) %>%
select(Variable_Name, Variable_Type, No_of_distinct_values)

Type

par(mfrow=c(1,2))

barplot(table(housing$Type), col= c(4,2,5), main="Type")
legend("topright", legend=c("h", "t", "u"), fill=c(4,2,5))

boxplot(housing$Price ~ housing$Type,
        col=c(4,2,5), xlab="Type", ylab="Price")
legend("topright", legend=c("h", "t", "u"), fill=c(4,2,5))

We can see the different number of properties per Type and the different relative price distribution.

Region Name

par(mfrow=c(1,2))


barplot(table(housing$Regionname), col=rainbow(20)[13:20], 
        main="Distribution of Housing by Region", 
        ylab="Count", xlab="Region Code",
        las=2)


boxplot(housing$Price ~ housing$Regionname,
        col=rainbow(25)[13:20], 
        xlab="Region Code", ylab="Price",
        main="Housing Prices by Region",
        las=2)

We see that there is a considerable difference in the number of properties for each RegionName. Moreover, even just by taking into consideration the 2 regions with the most number of properties we can see how the price distribution changes.

Building Area

par(mfrow=c(1,2))

hist(housing$BuildingArea, main="Building area distribution",
     xlab="Area",freq= FALSE, col= "lightblue", breaks = 30)
abline(v=mean(housing$BuildingArea), col="blue")

plot(housing$BuildingArea,housing$Price, col="darkred",
     xlab="Area", ylab="Price",
     main="Scatter plot (Price~Area)")

#abline(lm(Price ~ BuildingArea, data = housing))
# i don't think the lm line makes sense here

We see that with increasing BuildingArea the price tends to increase.

Spatial visualization

Using “leaflet” we create a spatial interactive visualization (since we have longitude and latitude) and we position each property on a map.

More information on the package can be found under https://r-charts.com/spatial/interactive-maps-leaflet/#?utm_content=cmp-true .

# Create a df for spatial visualization
spatial_housing<- housing %>%
  select(Price, Type, Lattitude, Longtitude)

# Visualize 
leaflet() %>%
  addTiles() %>%
  setView(lng = 144, lat = -38, zoom = 8) %>%
   addMarkers(data = data.frame(lng = spatial_housing$Longtitude, 
                                lat = spatial_housing$Lattitude ),
              # Add clustering to better handle large number of data and improve
              # visualization 
              clusterOptions = markerClusterOptions())

(Click on the map)

We now create a simple interactive dashboard with the help of shiny. This is particularly connected with our hypothetical business case, non technical people will have to consult the data and the creation of a simple variables will allow them to filter the data easily and access the visualization.

Our dashboard allows to filter for a certain price range and visualize location and type distribution within that price range.

#Define UI
ui <- fluidPage(
  theme=shinytheme("flatly"),
  titlePanel("Interactive  Dashboard"),
  sidebarLayout(
    
    sidebarPanel(
      
      #Add a slider input for price 
      sliderInput("price_range", "Price Range:",
                  min = min(spatial_housing$Price), 
                  max = max(spatial_housing$Price),
                  value = c(min(spatial_housing$Price), max(spatial_housing$Price)),
                  step = 10000)
    ),
    
    mainPanel(
      leafletOutput("map"),
       plotOutput("barplot"))
    
  )
)

#Define server 
server <- function(input, output) {
  output$map <- renderLeaflet({
    
    #Filter data 
    filtered_data <- spatial_housing[
      spatial_housing$Price >= input$price_range[1] &
      spatial_housing$Price <= input$price_range[2],
    ]
    # Previus created map with the filtered data
    leaflet() %>%
      addTiles() %>%
      setView(lng = 144, lat = -38, zoom = 6) %>%
 addMarkers(data = data.frame(lng = filtered_data$Longtitude, lat = filtered_data$Lattitude ),clusterOptions = markerClusterOptions())
  })
  
    #Add a bar plot with the type
  output$barplot <- renderPlot({
    
        #Filter data 
    filtered_data <- spatial_housing[
      spatial_housing$Price >= input$price_range[1] &
      spatial_housing$Price <= input$price_range[2],
    ]
    
    #Bar plot with ggplot
  ggplot(filtered_data, aes(x = Type, fill = Type)) +
  geom_bar() +
  geom_text( aes(label = stat(count)), stat = 'count',size = 8,
    position = position_stack(0.5)) +
  labs(title = "", x = "Count", y = "Numbers") +
  scale_fill_manual(values = c("t" = 4, "h" = 2, "u"= 5 ), 
                    name = "Legend",  
                    labels = c("t" = "Townhouses", "h" = "Houses ","u" = "Unit" ))+
  theme(axis.text.x = element_text(size = 15),
         plot.title = element_text(size = 30)) 
    
  })
  
}

#Run the application to access the dashboard and host it locally 
shinyApp(ui,server)

Shiny applications not supported in static R Markdown documents

Run the code in the Rmd file for hosting the dashboard locally and access it through the browser.

3. Modeling

Since the aim of our analysis is to predict prices, we are faced with a regression problem. In order to find an approach that accurately models prices, we fit three different models to our data: A multiple linear regression model, a regression tree and a random forest. After fitting our models, we assess their performance on unseen data by splitting our available data into a training set and a test set . This should help us to select the optimal model that we can then apply to answer business question in a final step.

Preparation

Before fitting our models, we have a closer look at the data of type factor. Since categorical variables are represented by an individual dummy variable for each level in the linear regression model, variables with too many levels are not useful.

ExpData(housing,type=2) %>% filter(Variable_Type == "factor") %>%
  select(Variable_Name, Variable_Type, No_of_distinct_values)

The variable Addressis unique to each variable and therefore should be excluded from the prediction. Furthermore, the variables Suburb, SellerG, Postcode and CouncilArea seem to have too many levels to be useful predictors in a regression. We therefore create a subset of our data without these variables.

Additionally, we believe that CouncilArea, Postcode and Suburb are already reflected in the variable Regionname which has fewer unique values than the other three aforementioned variables. With the same reasoning in mind we also decide to drop the variables Longtitudeand Lattitude.

regression_subset <- subset(housing, select = -c(Address, SellerG, Suburb, CouncilArea, Lattitude,
                                                 Longtitude, Postcode))

3.1 Model 1: Multiple Linear Regression

We can now fit a linear regression model to our reduced data set.

model1 <- lm(Price~., data=regression_subset)
summary(model1)

## 
## Call:
## lm(formula = Price ~ ., data = regression_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4318088  -213164   -36329   157782  8095480 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.348e+06  6.256e+05   5.352 8.99e-08 ***
## Rooms          1.281e+05  1.765e+04   7.254 4.50e-13 ***
## Typet         -1.701e+05  1.920e+04  -8.858  < 2e-16 ***
## Typeu         -3.505e+05  1.653e+04 -21.208  < 2e-16 ***
## MethodS        7.280e+04  1.543e+04   4.717 2.44e-06 ***
## MethodSA      -3.166e+04  6.312e+04  -0.502 0.615971    
## MethodSP       4.342e+04  1.933e+04   2.247 0.024690 *  
## MethodVB       1.779e+04  2.166e+04   0.821 0.411451    
## Date           2.035e+02  3.215e+01   6.328 2.64e-10 ***
## Distance      -4.180e+04  1.243e+03 -33.617  < 2e-16 ***
## Bedroom2      -5.546e+03  1.746e+04  -0.318 0.750757    
## Bathroom       1.979e+05  9.646e+03  20.516  < 2e-16 ***
## Car            5.316e+04  5.915e+03   8.988  < 2e-16 ***
## Landsize       1.824e+01  5.446e+00   3.349 0.000816 ***
## BuildingArea   1.613e+03  7.235e+01  22.288  < 2e-16 ***
## YearBuilt     -3.086e+03  1.630e+02 -18.932  < 2e-16 ***
## RegionnameeV  -4.847e+05  8.449e+04  -5.737 1.00e-08 ***
## RegionnamenM  -2.128e+05  8.390e+04  -2.536 0.011236 *  
## RegionnamenV   2.730e+04  8.466e+04   0.323 0.747067    
## RegionnameseM  3.486e+04  8.464e+04   0.412 0.680444    
## RegionnamesM  -4.294e+05  8.527e+04  -5.035 4.90e-07 ***
## RegionnamewM   1.118e+05  1.099e+05   1.018 0.308686    
## RegionnamewV  -1.923e+05  1.199e+05  -1.604 0.108832    
## Propertycount -6.197e-02  1.206e+00  -0.051 0.959011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 402500 on 6789 degrees of freedom
## Multiple R-squared:  0.643,  Adjusted R-squared:  0.6418 
## F-statistic: 531.7 on 23 and 6789 DF,  p-value: < 2.2e-16

Still, we can observe that a rather large number of variables is used to predict the price and a few coefficients are insignificant, i.e. their corresponding p-value exceeds our chosen significance level of $\alpha=0.05$. Despite the rather large amount of variables that our model considers, our R-squared only corresponds to 0.643. To improve our model, we perform a stepwise variable selection using the step() function. By doing so, we would like to ensure that our model only includes variables that add a lot of unique informaiton to the model.

model1_step <- step(model1, direction="backward")

## Start:  AIC=175872
## Price ~ Rooms + Type + Method + Date + Distance + Bedroom2 + 
##     Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname + 
##     Propertycount
## 
##                 Df  Sum of Sq        RSS    AIC
## - Propertycount  1 4.2786e+08 1.0996e+15 175870
## - Bedroom2       1 1.6343e+10 1.0996e+15 175870
## <none>                        1.0996e+15 175872
## - Landsize       1 1.8162e+12 1.1014e+15 175881
## - Method         4 5.0043e+12 1.1046e+15 175895
## - Date           1 6.4864e+12 1.1061e+15 175910
## - Rooms          1 8.5226e+12 1.1081e+15 175923
## - Car            1 1.3084e+13 1.1127e+15 175951
## - YearBuilt      1 5.8054e+13 1.1577e+15 176221
## - Bathroom       1 6.8177e+13 1.1678e+15 176280
## - Type           2 7.3022e+13 1.1726e+15 176306
## - BuildingArea   1 8.0459e+13 1.1801e+15 176351
## - Distance       1 1.8304e+14 1.2826e+15 176919
## - Regionname     7 3.0070e+14 1.4003e+15 177505
## 
## Step:  AIC=175870
## Price ~ Rooms + Type + Method + Date + Distance + Bedroom2 + 
##     Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname
## 
##                Df  Sum of Sq        RSS    AIC
## - Bedroom2      1 1.6293e+10 1.0996e+15 175868
## <none>                       1.0996e+15 175870
## - Landsize      1 1.8163e+12 1.1014e+15 175879
## - Method        4 5.0077e+12 1.1046e+15 175893
## - Date          1 6.4896e+12 1.1061e+15 175908
## - Rooms         1 8.5229e+12 1.1081e+15 175921
## - Car           1 1.3086e+13 1.1127e+15 175949
## - YearBuilt     1 5.8054e+13 1.1577e+15 176219
## - Bathroom      1 6.8177e+13 1.1678e+15 176278
## - Type          2 7.3311e+13 1.1729e+15 176306
## - BuildingArea  1 8.0461e+13 1.1801e+15 176349
## - Distance      1 1.8389e+14 1.2835e+15 176922
## - Regionname    7 3.0088e+14 1.4005e+15 177504
## 
## Step:  AIC=175868.1
## Price ~ Rooms + Type + Method + Date + Distance + Bathroom + 
##     Car + Landsize + BuildingArea + YearBuilt + Regionname
## 
##                Df  Sum of Sq        RSS    AIC
## <none>                       1.0996e+15 175868
## - Landsize      1 1.8189e+12 1.1014e+15 175877
## - Method        4 5.0126e+12 1.1046e+15 175891
## - Date          1 6.4907e+12 1.1061e+15 175906
## - Car           1 1.3073e+13 1.1127e+15 175947
## - Rooms         1 3.5863e+13 1.1355e+15 176085
## - YearBuilt     1 5.8039e+13 1.1577e+15 176217
## - Bathroom      1 6.9100e+13 1.1687e+15 176281
## - Type          2 7.3404e+13 1.1730e+15 176304
## - BuildingArea  1 8.0445e+13 1.1801e+15 176347
## - Distance      1 1.8420e+14 1.2838e+15 176921
## - Regionname    7 3.0115e+14 1.4008e+15 177503

summary(model1_step)

## 
## Call:
## lm(formula = Price ~ Rooms + Type + Method + Date + Distance + 
##     Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname, 
##     data = regression_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4318201  -213097   -36403   157946  8095245 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.365e+06  6.233e+05   5.398 6.98e-08 ***
## Rooms          1.231e+05  8.272e+03  14.882  < 2e-16 ***
## Typet         -1.699e+05  1.919e+04  -8.854  < 2e-16 ***
## Typeu         -3.503e+05  1.647e+04 -21.268  < 2e-16 ***
## MethodS        7.285e+04  1.542e+04   4.723 2.37e-06 ***
## MethodSA      -3.177e+04  6.309e+04  -0.504 0.614600    
## MethodSP       4.341e+04  1.931e+04   2.248 0.024610 *  
## MethodVB       1.783e+04  2.165e+04   0.824 0.410049    
## Date           2.024e+02  3.197e+01   6.331 2.59e-10 ***
## Distance      -4.182e+04  1.240e+03 -33.728  < 2e-16 ***
## Bathroom       1.975e+05  9.560e+03  20.658  < 2e-16 ***
## Car            5.306e+04  5.905e+03   8.985  < 2e-16 ***
## Landsize       1.825e+01  5.445e+00   3.352 0.000808 ***
## BuildingArea   1.612e+03  7.233e+01  22.289  < 2e-16 ***
## YearBuilt     -3.085e+03  1.630e+02 -18.932  < 2e-16 ***
## RegionnameeV  -4.851e+05  8.439e+04  -5.749 9.38e-09 ***
## RegionnamenM  -2.133e+05  8.382e+04  -2.544 0.010972 *  
## RegionnamenV   2.682e+04  8.445e+04   0.318 0.750805    
## RegionnameseM  3.457e+04  8.456e+04   0.409 0.682674    
## RegionnamesM  -4.301e+05  8.485e+04  -5.068 4.12e-07 ***
## RegionnamewM   1.116e+05  1.097e+05   1.017 0.309104    
## RegionnamewV  -1.926e+05  1.199e+05  -1.606 0.108334    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 402400 on 6791 degrees of freedom
## Multiple R-squared:  0.643,  Adjusted R-squared:  0.6419 
## F-statistic: 582.5 on 21 and 6791 DF,  p-value: < 2.2e-16

The step() function suggests the following predictor variables to be used in our linear regression: Rooms, Type, Method, Date, Distance, Bathroom, Car, Landsize, BuildingArea, YearBuilt and Regionname. By including only these variables, AIC is minimized.

Linear regression is a parametric approach, i.e. it assumes a certain functional form. Before being able to interpret the coefficients of our model, we therefore perform a simple residual diagnostic, in order to ensure that our underlying model assumptions are fulfilled. To do so, we plot our fitted values against our residuals.

par(mfrow=c(1,2))
plot(fitted(model1_step), residuals(model1_step), col="steelblue", cex=0.5,
     main="Residual Plot", ylab="Residuals")
abline(h=0, col="coral")
qqnorm(residuals(model1_step), col="steelblue", cex=0.8)
qqline(residuals(model1_step), col="coral")

Unfortunately, residuals exhibit a trumpet shape when plotted against their fitted values, suggesting heteroscedastic error terms. Our qq-plot furthermore suggests that our errors do not follow a normal distribution. We conclude that our model assumptions are not sufficiently fulfilled and that our variables need further transformation.

We therefore fit a log-level model of the form $logY = \beta_0 + \beta_1*X_1 + ... + \beta_j*X_j + \epsilon$ where $\mathbb{E}(\epsilon|X)=0$.

log_level <- lm(formula = log(Price) ~ Rooms + Type + Method + Date +Distance +
    Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname, 
    data = regression_subset)
summary(log_level)

## 
## Call:
## lm(formula = log(Price) ~ Rooms + Type + Method + Date + Distance + 
##     Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname, 
##     data = regression_subset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86610 -0.16780 -0.00301  0.16452  2.35010 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.525e+01  4.196e-01  36.356  < 2e-16 ***
## Rooms          1.375e-01  5.568e-03  24.697  < 2e-16 ***
## Typet         -1.293e-01  1.292e-02 -10.011  < 2e-16 ***
## Typeu         -4.519e-01  1.109e-02 -40.764  < 2e-16 ***
## MethodS        9.830e-02  1.038e-02   9.468  < 2e-16 ***
## MethodSA       3.947e-02  4.247e-02   0.929  0.35275    
## MethodSP       5.743e-02  1.300e-02   4.418 1.01e-05 ***
## MethodVB      -4.409e-03  1.457e-02  -0.303  0.76221    
## Date           1.690e-04  2.152e-05   7.854 4.65e-15 ***
## Distance      -3.877e-02  8.345e-04 -46.461  < 2e-16 ***
## Bathroom       1.212e-01  6.435e-03  18.835  < 2e-16 ***
## Car            3.913e-02  3.975e-03   9.845  < 2e-16 ***
## Landsize       1.094e-05  3.665e-06   2.986  0.00284 ** 
## BuildingArea   9.917e-04  4.868e-05  20.370  < 2e-16 ***
## YearBuilt     -2.366e-03  1.097e-04 -21.568  < 2e-16 ***
## RegionnameeV  -3.560e-01  5.680e-02  -6.268 3.88e-10 ***
## RegionnamenM  -4.682e-02  5.642e-02  -0.830  0.40658    
## RegionnamenV   6.454e-02  5.684e-02   1.135  0.25621    
## RegionnameseM  1.415e-01  5.692e-02   2.486  0.01293 *  
## RegionnamesM  -3.128e-01  5.711e-02  -5.476 4.50e-08 ***
## RegionnamewM   2.175e-01  7.383e-02   2.946  0.00323 ** 
## RegionnamewV  -3.758e-01  8.072e-02  -4.655 3.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2708 on 6791 degrees of freedom
## Multiple R-squared:  0.751,  Adjusted R-squared:  0.7503 
## F-statistic: 975.6 on 21 and 6791 DF,  p-value: < 2.2e-16

With our new model specification, we were able to increase $Adjusted R^2$ to 0.7503, i.e. our model is able to explain 75.03% of the variation in the data.

For our log-level, we again perform a residual diagnostic.

par(mfrow=c(1,2))
plot(fitted(log_level), residuals(log_level), col="steelblue", cex=0.5,
     main="Residual Plot", ylab="Residuals")
abline(h=0, col="coral")
qqnorm(residuals(log_level), col="steelblue", cex=0.8)
qqline(residuals(log_level), col="coral")

We now observe that residuals are not only randomly scattered around zero, but also show constant variance as log(Price) increases. Our qq-plot suggests that our residuals now better approach a normal distribution. The two plots suggest, that by using a log-level model, our assumptions of homoscedasticity, i.e. constant variance, and errors with conditional mean zero are now fulfilled. This allows us to assume unbiasedness, i.e. our estimates are on average equal to the true population parameters.

By looking more closely at the estimates for the coefficients of our variables, we are able to assess which variables are considered to have a positive effect and which variables show a negative impact on price according to our linear regression model.

coefficients <- coef(log_level)

# Separate variables with positive and negative coefficients
positive_vars <- names(coefficients[coefficients > 0])
negative_vars <- names(coefficients[coefficients < 0])

positive_df <- data.frame( Coefficient = coefficients[positive_vars])
negative_df <- data.frame( Coefficient = coefficients[negative_vars])

All other factors held fixed, an increase in the following variables is associated with an increase the average property price.

knitr::kable(positive_df, digits = 2, caption = "Positive Coefficients")

Positive Coefficients
	Coefficient
(Intercept)	15.25
Rooms	0.14
MethodS	0.10
MethodSA	0.04
MethodSP	0.06
Date	0.00
Bathroom	0.12
Car	0.04
Landsize	0.00
BuildingArea	0.00
RegionnamenV	0.06
RegionnameseM	0.14
RegionnamewM	0.22

In contrast, with all other factors held fixed, increasing the following variables decreases our estimated price.

knitr::kable(negative_df, digits = 2, caption = "Negative Coefficients")

Negative Coefficients
	Coefficient
Typet	-0.13
Typeu	-0.45
MethodVB	0.00
Distance	-0.04
YearBuilt	0.00
RegionnameeV	-0.36
RegionnamenM	-0.05
RegionnamesM	-0.31
RegionnamewV	-0.38

Since our model now assumes a linear relationship between the independent variables and the log of the dependent variable, we have to adjust our interpretations of the coefficients accordingly. In particular, an increase of $X_j$ increases log(Price) by $\beta_j$ and Price by $(e^{\beta_j}-1)*100\%$. For example, if our property has one additional Room, we would expect the Price to increase by $(e^{0.14}-1)*100 = 15.027\%$.

In our opinion, the coefficients suggested by our linear regression model seem very intuitive. Increasing the number of Rooms, Bathrooms etc. is associated with a price increase, whereas prices generally decrease as the Distance to the Central Business District increases.

For our categorical variables, coefficients need to be interpreted in comparison to the baseline. We therefore have a quick glance at the coefficients for the different categories of Regionname and Type.

barplot(coef(log_level)[16:22], col=rainbow_hcl(30)[10:16],
 cex.names=0.35, names=c("Eastern Victoria","Northern Metropolitan", "Northern Victoria","South-Eastern Metropolitan","Southern Metropolitan","Western Metropolitan","Western Victoria"), las=2, main="Coefficients for Regionname as compared to the baseline 'Eastern Metropolitan' ", cex.main=0.7)

We needed to introduce dummy variables for our regions as well as the type of real estate. The output suggests that compared to our baseline region Eastern Metropolitan real estate objects in Northern Victoria and in South-Eastern as well as Western Metropolitan can be sold at a higher price. However, the effect for the two former regions is only marginal. Additionally, we can also see that real estate objects in Eastern Victoria and Southern Metropolitan are worth much less than those in Eastern Metropolitan.

barplot(coef(log_level)[3:4], col=rainbow_hcl(25)[17:20],
 cex.names=0.9, names=c("townhouse","unit"),main="Coefficients for Type as compared to the baseline 'house' " )

Regarding the different housing types that we considered, it is quite obvious that real estate objects that are a townhouse or unit are less expensive than our baseline category house.

3.2 Model 2: Single Pruned Tree

As a second approach, we would like to fit a regression tree to our housing data. Since generally, regression trees are very easy to interpret, they offer a convenient way to present our findings to the Construction Company.

Similar to the linear regression model, we are required to exclude certain variables in advance that would otherwise be to specific for a regression tree. Pruning the tree will then take care of selecting the appropriate variables from our manually created subset of variables.

The variables that we choose to exclude for the computation of our tree are the same that we also excluded for our linear regression. Out of the remaining variables, we compute a full tree and then prune it.

full_tree <- rpart(Price ~ ., data = regression_subset, control = list(cp = 0)) # We compute the whole tree

# printcp(full_tree)
par(mfrow = c(1,2))
rpart::plotcp(full_tree) 

# We see that the plot approaches the mean error between 20 and 30 so we change the limits of our x-axis to confirm this observation 
rpart::plotcp(full_tree, xlim=c(0, 40))

This plot suggests that a good choice of cp for pruning would be the one corresponding to roughly 25 splits as this is the leftmost value for which the mean relative error starts to flatten out. However, due to limited space for plotting such an extensive tree and in line with our goal to keep the tree simple and easily interpretable, we have to opt for a smaller value. Therefore, we decide to choose 9 as the final value as the plot already slowly begins to level off at this point.

# To have a rather small tree, we arbitrarily restrict the number of splits to 5 with a corresponding cp of 0.02501177
prunedtree <- prune(full_tree, cp=full_tree$cptable[which(full_tree$cptable[,"nsplit"] == 9),"CP"])
plot(as.party(prunedtree))

From the obtained tree we can infer that Regionname, Distance, BuildingArea and YearBuilt are considered as decisive criteria for splitting the tree and predicting prices. However, it is important to keep two things in mind - first, we subsetted the data to limit the overall amount of variables and second, that trees function in a top-down, greedy manner, i.e. they split the data at each node irrespective of how this split might affect later splits. By specifying the number of splits to 9, we further constrained the flexibility of the tree. Additionally, one might want to consider that there is a certain correlation between Regionname and Distance, i.e. the distance to the Central Business District.

Our tree suggests the following:

In general, the price seems to heavily depend on the BuildingArea, with an upward trend to higher prices as the BuildingArea increases. The price also appears to be higher as the distance to the Central Business District decreases. Both of these aspects align with our expectations.

Specifically, we can see that houses in the Southern Metropolitan Area with a building size of at least 266.265 units exhibit the highest price range. If additionally the distance to the Central Business District is taken into account, one can see that houses closer to the CBD, namely closer than 7.45 units, are more expensive. However, one must consider that there are only 57 observations that belong to this leaf node which is not a lot considering that we have 6813 observations overall.

Most of our observations, namely 1435, fall into the category of houses that were built after 1949 and have a building size of less than 94.5 units. These houses are also considered to be least expensive as the distribution of our leftmost terminal node suggests.

Surprisingly, when our tree splits our data according to the region, it differentiates between real estate objects in Northern Victoria and objects in all other regions. This is a bit different from what our linear regression model suggested earlier. According to the linear regression model the region Northern Victoria has a positive impact on the housing price but the effect of the Western Metropolitan area was much stronger. Therefore we would have expected that our tree would have differentiated between real estate objects in Western Metropolitan and objects in all other regions.

3.3 Model 3: Random Forest

The third model we fit to the data is the random forest.

randomforest <- ranger( Price ~ ., data = regression_subset, importance="permutation")
print(randomforest)

## Ranger result
## 
## Call:
##  ranger(Price ~ ., data = regression_subset, importance = "permutation") 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      6813 
## Number of independent variables:  13 
## Mtry:                             3 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       96431317954 
## R squared (OOB):                  0.7867528

Our random forest includes 500 trees. In this case R-squared is used to assess the reliability on samples that have not yet been used to build the given tree. The output suggests that our model is able to explain roughly 78.9% of the variation in housing prices.

We plot the importance scores to find out which variables are considered to be most important in the model.

plot(as.table(importance(randomforest)), ylab="", las=2, cex.axis=0.7, main= "Importance Scores")

The plot suggests that our random forest considers the most important variables for predicting housing prices to be BuildingArea and Distance, followed by Regionname, YearBuilt and Rooms.

This suggests that apart from the size of the building, the location of the house which is related to the variables Distance, namely the distance to the Central Business District, and Regionname is most important.

Comparison to the tree

With this conclusion, the random forest agrees with our single pruned tree when it comes to the most important variables in the prediction of prices. Due to the different modeling approach, the random forest is, however, able to put further emphasis on additional variables that were not considered in the tree. In particular, the random forest considers the variable Rooms to be among the five most important variables. The single pruned tree is built on fewer variables and hence does not reflect this anymore.

We investigate the most important variable in the prediction by the random forest, namely Distance, using a partial dependency plot.

grd <- 10^seq(min(log10(regression_subset$Distance[regression_subset$Distance > 0])), max(log10(regression_subset$Distance)), length.out=2e1)
nd <- regression_subset[sample.int(nrow(regression_subset), 300),]
prs <- lapply(grd, function(val) {
nd$Distance <- val
predict(randomforest, data=nd, type="response")$predictions
})
matplot(grd, t(do.call("cbind", prs)), type="l", col=rgb(.1,.1,.1,.1),lty=1, log="x",
xlab = "Distance to CBD", ylab = "Housing Price")
price_predictions <- rowMeans(t(do.call("cbind", prs)))
lines(grd, price_predictions, col="darkred", lwd=2)

The partial dependency plot predicts the mean housing price for all observations while changing Distance but keeping all other variables fixed. The plot confirms that mean housing prices fall as the distance to the Central Business District increases. The red line in the plot indicates the mean of the observations for a given distance to the CBD.

4. Model Evaluation

Now that we have computed three different models for price prediction, we would like to select that which is able to best predict prices of new properties. We therefore compare and evaluate our three models.

In order to evaluate our models, we would like to see how our models perform on new unseen data. To do so, we perform a ten-fold cross validation. Each of our observations is randomly assigned to one out of 10 folds. We then loop through our folds, training the models on all observations not belonging to this fold and testing it on all observations within this fold. This allows us to compute mean squared errors in each loop which we then plot in a boxplot.

n <- nrow(regression_subset)

fold <- 10

folds <- sample(rep(1:fold, ceiling(n/fold)), n)

MSE <- list()
for (tfold in seq_len(fold)) {
  train_idx <- which(folds != tfold)
  test_idx <- which(folds == tfold)
  
  randomforest <- ranger(Price ~ ., data = regression_subset[train_idx, ])
  
  tree <- prune(rpart(Price ~ ., data = regression_subset[train_idx,], control = list(cp = 0)), cp=full_tree$cptable[which(full_tree$cptable[,"nsplit"] == 9),"CP"])
  # Use the same pruned tree
  
  lrm <- lm(log(Price) ~ Rooms + Type + Method + Date + Distance + 
    Bathroom + Car + Landsize + BuildingArea + YearBuilt + Regionname, 
    data = regression_subset[train_idx, ])
    # Choose the variables selected by step()
  
  # Make predictions on the test data
  p_rf <- predict(randomforest, data = regression_subset[test_idx, ])$predictions
  p_tr <- predict(tree, newdata = regression_subset[test_idx, ])
  p_lr <- exp(predict(lrm, newdata = regression_subset[test_idx, ])) # take exp() to get the Price
  # rather than log(Price)
  
  MSE[[tfold]] <- unlist(lapply(
    list(random_forest = p_rf, tree = p_tr, linear_regression = p_lr),
    function(predicted) {
      
      # Calculate mean squared errors
      mean((regression_subset[test_idx, ]$Price - predicted)^2)
    }
  ))
}

We can now plot the Mean Squared Errors.

boxplot(do.call("rbind", MSE), main="Model Evaluation", col=c("forestgreen","steelblue","coral"),ylab="Mean Squared Errors", ylim=c(0, 2.5*10^11))

Interpretation:

Our boxplots suggest that the random forest significantly outperforms both the single pruned tree and the linear regression model. The tree seems to be too simple to represent the complexity of Melbourne’s housing markets and therefore performs worst. Generally, linear regression is better for inference purposes, i.e. if we want to understand how our predictors affect our target variable. However, its predictions are not as accurate as those of more flexible model such as our random forest. Therefore the random forest performs better than both the single tree and the linear regression.

5. Conclusion and application

We are very happy with the outcome of the project. For us as “InnovEstate Investors” this model constitutes a powerful tool that helps us to improve the performance of our real estate portfolio.

Here is how we apply our model in action. Four properties, that just entered the Melbourne real estate market, are stored in the csv file melb_projects.csv.

After loading the data we predict the value of the properties with our most accurate model.

melb_projects <- read.csv("melb_projects.csv", stringsAsFactors=TRUE)
head(melb_projects)

# we compute our random forest again
randomforest <- ranger( Price ~ ., data = regression_subset, importance="permutation")


predictions <- predict(randomforest, data =melb_projects)
predicted_prices <- predictions$predictions

print_data <- data.frame(Address = melb_projects$Address, Distance = melb_projects$Distance,
                         BuildingArea = melb_projects$BuildingArea, Regionname =
                           melb_projects$Regionname, Type = melb_projects$Type, Landsize
                         = melb_projects$Landsize, Predicted_Price = predicted_prices)


print_data <- print_data[order(print_data$Predicted_Price), ]
print(print_data)

##                Address Distance BuildingArea            Regionname Type
## 3      701 Park Street      5.2           82 Northern Metropolitan    u
## 2    7 Bangs Street St      4.6           93 Southern Metropolitan    u
## 1 198B Blackshaws Road     11.1          124  Western Metropolitan    h
## 4 75 Wellington Street      3.0          107 Northern Metropolitan    u
##   Landsize Predicted_Price
## 3        0        525549.4
## 2        0        765977.8
## 1      150        828754.1
## 4        0        830468.1

The sales price for 701 Park Street is $\$AUD 550,000$. As our predicted fair value is below that, we decide not to buy the property. On the other hand, the sales price of 198B Blackshaws Road is $\$AUD 690,600$. As our predicted price is above that, we decide to buy the property and expect its value to increase.