In this project, we will use Decision Trees and Random Forests to predict the price of a home in Melbourne, Australia.

The metric used to determine good-fit of the model is Rsquared, which determines the proportion of variability in the dependent variables that can be explained by the independent variable.

The dataset can be found on Kaggle.

Setup

Load Packages

library(ggplot2)
library(dplyr)
library(caret)
library(ranger)
library(rpart)

Load dataset

housingPrices <- read.csv("Melbourne_housing_FULL.csv")
head(housingPrices)
##       Suburb            Address Rooms Type   Price Method SellerG
## 1 Abbotsford      68 Studley St     2    h      NA     SS  Jellis
## 2 Abbotsford       85 Turner St     2    h 1480000      S  Biggin
## 3 Abbotsford    25 Bloomburg St     2    h 1035000      S  Biggin
## 4 Abbotsford 18/659 Victoria St     3    u      NA     VB  Rounds
## 5 Abbotsford       5 Charles St     3    h 1465000     SP  Biggin
## 6 Abbotsford   40 Federation La     3    h  850000     PI  Biggin
##        Date Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea
## 1 3/09/2016      2.5     3067        2        1   1      126           NA
## 2 3/12/2016      2.5     3067        2        1   1      202           NA
## 3 4/02/2016      2.5     3067        2        1   0      156           79
## 4 4/02/2016      2.5     3067        3        2   1        0           NA
## 5 4/03/2017      2.5     3067        3        2   0      134          150
## 6 4/03/2017      2.5     3067        3        2   1       94           NA
##   YearBuilt        CouncilArea Lattitude Longtitude            Regionname
## 1        NA Yarra City Council  -37.8014   144.9958 Northern Metropolitan
## 2        NA Yarra City Council  -37.7996   144.9984 Northern Metropolitan
## 3      1900 Yarra City Council  -37.8079   144.9934 Northern Metropolitan
## 4        NA Yarra City Council  -37.8114   145.0116 Northern Metropolitan
## 5      1900 Yarra City Council  -37.8093   144.9944 Northern Metropolitan
## 6        NA Yarra City Council  -37.7969   144.9969 Northern Metropolitan
##   Propertycount
## 1          4019
## 2          4019
## 3          4019
## 4          4019
## 5          4019
## 6          4019

Clean dataset

Remove rows with NA

housingPrices <- na.omit(housingPrices)

Remove identifying columns

housingPrices <- housingPrices %>%
  select(-c(Suburb, Address, Method, SellerG, Date, Postcode, Lattitude, 
            Longtitude, Propertycount))

EDA

cor(numerical)
##                     Rooms      Price    Distance    Bedroom2    Bathroom
## Rooms         1.000000000  0.4750737 -0.09530502  0.96446451  0.62407018
## Price         0.475073693  1.0000000  0.11778468  0.46088037  0.46350104
## Distance     -0.095305024  0.1177847  1.00000000 -0.10233076 -0.05940999
## Bedroom2      0.964464508  0.4608804 -0.10233076  1.00000000  0.62649267
## Bathroom      0.624070179  0.4635010 -0.05940999  0.62649267  1.00000000
## Car           0.401422793  0.2094641 -0.10421393  0.40556972  0.31096223
## Landsize      0.101158400  0.0583748 -0.01860843  0.10103520  0.07593872
## BuildingArea  0.606738137  0.5072844 -0.03706485  0.59529870  0.55385498
## YearBuilt     0.006934521 -0.3136639 -0.18901468  0.01631045  0.19291424
##                     Car    Landsize BuildingArea    YearBuilt
## Rooms         0.4014228  0.10115840   0.60673814  0.006934521
## Price         0.2094641  0.05837480   0.50728445 -0.313663866
## Distance     -0.1042139 -0.01860843  -0.03706485 -0.189014677
## Bedroom2      0.4055697  0.10103520   0.59529870  0.016310451
## Bathroom      0.3109622  0.07593872   0.55385498  0.192914236
## Car           1.0000000  0.12349803   0.31759330  0.139254682
## Landsize      0.1234980  1.00000000   0.08322897  0.037753043
## BuildingArea  0.3175933  0.08322897   1.00000000  0.059936350
## YearBuilt     0.1392547  0.03775304   0.05993635  1.000000000

From the correlation matrix pairs, pairs involving BuildingArea such as Rooms, Bedrooms2, and Bathrooms have high degrees of correlation. In particular Bedroom2 and Rooms have the highest degree of correlation at 0.96446451. In this case, I have removed Bedroom2 from the model.

ggplot(housingPrices, aes(Type, Price)) +
  geom_boxplot(outlier.colour = "black") + 
  scale_x_discrete(labels = c('Houses','Townhouses','Units')) +
  scale_y_continuous(breaks=seq(0,10000000,1250000)) +
  xlab("Type of Home") +
  ylab("Price") +
  ggtitle("Price Distribution of Home Type")

housingPrices %>%
  filter(YearBuilt != 1196) %>%
  select(Price, Type, YearBuilt) %>%
  group_by(YearBuilt, Type) %>%
  summarise(Mean = mean(Price)) %>%
  ggplot(aes(x=YearBuilt, y=Mean, color=Type)) +
  geom_smooth(method = 'loess') +
  geom_line() +
  scale_color_discrete(name = "Type of Home", 
                       labels=c("House", "Townhouse", "Unit")) +
  xlab("Year") +
  ylab("Price") +
  ggtitle("Average Price of Homes")

Over the years there has been a decrease in average prices of houses, with houses increasing in price around 2000. Average prices of townhouses and units have stayed relatively stable.

ggplot(housingPrices, aes(Regionname, Price)) +
  geom_boxplot(outlier.colour = "black") +
  scale_y_continuous(breaks=seq(0,10000000,1250000)) +
  theme(legend.position = "none") +
  xlab("Region") +
  ylab("Price") +
  ggtitle("Region Price Distributions") +
  coord_flip()

Homes in the Southern Metropolitan region has the highest average price. This makes sense since this region is the center-most region in Melbourne.

housingPrices %>%
  filter(YearBuilt != 1196) %>%
  ggplot(aes(x=YearBuilt, y=Price, color=Regionname)) +
  scale_x_continuous(breaks=seq(1800,2020,25)) +
  geom_smooth(method = 'loess') +
  xlab("Year") +
  ylab("Price") +
  ggtitle("Region Prices") +
  labs(color = "Region")

We can observe the years of region development, with some regions being new or older than others. Homes in the southern regions have the steepest-rise in price post-1975, while other regions have had a relative increase in prices.

housingPrices %>%
  group_by(CouncilArea) %>%
  summarise(Count = n()) %>%
  ggplot(aes(reorder(CouncilArea, Count), Count)) +
  geom_bar(stat = 'identity') +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
  coord_flip() +
  xlab("Council Area") +
  ylab("Count") +
  ggtitle("Council Area Frequencies")

There are many council areas with Moorabool Shire having 1 home and Boroondara City having over 800 homes. To reduce the number of levels for this variable I will convert levels with fewer than 103 counts to “Other”.

Feature Engineering

Create BuildingAge feature

housingPrices$BuildingAge <- 2020 - housingPrices$YearBuilt
housingPrices$BuildingAge <- as.integer(housingPrices$BuildingAge)

Remove outliers

housingPrices <- housingPrices %>% 
  filter(BuildingAge != 824, Price != 9000000, BuildingArea < 3000)

Reducing category levels

temp <- housingPrices %>%
  select(CouncilArea) %>%
  group_by(CouncilArea) %>%
  count(CouncilArea) %>%
  arrange(desc(n)) %>%
  filter(n <= 103)

housingPrices$CouncilArea <- as.character(housingPrices$CouncilArea)

housingPrices <- housingPrices %>%
  mutate(CouncilArea = replace(CouncilArea, CouncilArea %in% temp[1]$CouncilArea, 
                               "Other"))

housingPrices$CouncilArea <- as.factor((housingPrices$CouncilArea))

Creating dummy variables

dummy_obj <- dummyVars(~ Type + Regionname + CouncilArea, data = housingPrices, 
                       sep = ".", levelsOnly = TRUE)
dummies <- predict(dummy_obj, housingPrices)

housingPrices <- cbind(housingPrices, dummies)
housingPrices <- housingPrices %>%
  select(-c(Type, Regionname, CouncilArea))

Train test split

Used logarithmic transformation of response variable to linearize relationship between dependent variables.

set.seed(444)

data <- housingPrices %>%
  select(-c(Bedroom2, YearBuilt))
data$Price <- log(data$Price)
colnames(data) <- make.names(colnames(data))

train_ind <- createDataPartition(y = data$Price, p = 0.8, list = FALSE)

training <- data[train_ind,]
testing <- data[-train_ind,]

Decision Tree

dt_ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
fitted_dt <- train(Price~., data = training,
                   trControl = dt_ctrl,
                   tuneLength = 10,
                   method = "rpart",
                   metric = "Rsquared")
fitted_dt
## CART 
## 
## 7109 samples
##   42 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 6398, 6398, 6398, 6398, 6399, 6398, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE       Rsquared   MAE      
##   0.01369936  0.3353080  0.6054433  0.2611066
##   0.01370926  0.3356166  0.6047248  0.2613788
##   0.01485158  0.3480106  0.5752050  0.2721267
##   0.01512019  0.3501060  0.5700850  0.2739008
##   0.01717110  0.3589371  0.5480675  0.2816320
##   0.03018552  0.3679187  0.5249618  0.2896105
##   0.04792287  0.3832321  0.4846719  0.3041181
##   0.10816361  0.4094428  0.4111957  0.3251637
##   0.11991666  0.4544823  0.2749336  0.3643463
##   0.23875988  0.5069134  0.2140251  0.4094569
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01369936.
plot(varImp(fitted_dt))

Variables such as council area are not as important to the decision tree model.

plot(fitted_dt$finalModel)
text(fitted_dt$finalModel)

Displays the splits the decision tree made.

predict_dt <- predict(fitted_dt, testing)
R2(predict_dt, testing$Price)
## [1] 0.6246983

Both training and testing Rsquared values are relatively close resulting in a good fitted model.

Random Forest

rf_ctrl <- trainControl(method = "cv", number = 5)
fitted_rf <- train(Price~., data = training,
                   trControl = rf_ctrl,
                   tuneLength = 10,
                   method = "ranger",
                   importance = "permutation",
                   metric = "Rsquared",
                   verbose = TRUE)
fitted_rf
## Random Forest 
## 
## 7109 samples
##   42 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 5688, 5687, 5688, 5687, 5686 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE       Rsquared   MAE      
##    2    variance    0.2923992  0.8066375  0.2273218
##    2    extratrees  0.3437727  0.7339959  0.2695868
##    6    variance    0.2071892  0.8573710  0.1560067
##    6    extratrees  0.2361321  0.8176019  0.1797260
##   10    variance    0.1967010  0.8668429  0.1470099
##   10    extratrees  0.2089560  0.8507330  0.1577454
##   15    variance    0.1951739  0.8678133  0.1455961
##   15    extratrees  0.1987813  0.8630721  0.1493470
##   19    variance    0.1953982  0.8671074  0.1456743
##   19    extratrees  0.1963984  0.8659704  0.1472066
##   24    variance    0.1964517  0.8654543  0.1461907
##   24    extratrees  0.1954361  0.8670459  0.1464573
##   28    variance    0.1973864  0.8640132  0.1467639
##   28    extratrees  0.1947382  0.8679373  0.1458742
##   33    variance    0.1990132  0.8616162  0.1477506
##   33    extratrees  0.1946094  0.8679752  0.1457245
##   37    variance    0.2007517  0.8590796  0.1487021
##   37    extratrees  0.1947529  0.8677406  0.1457078
##   42    variance    0.2043512  0.8537970  0.1510615
##   42    extratrees  0.1951966  0.8670938  0.1458868
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 33, splitrule =
##  extratrees and min.node.size = 5.
print(fitted_rf$finalModel)
## Ranger result
## 
## Call:
##  ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      7109 
## Number of independent variables:  42 
## Mtry:                             33 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        extratrees 
## Number of random splits:          1 
## OOB prediction error (MSE):       0.03694332 
## R squared (OOB):                  0.8703749
plot(varImp(fitted_rf))

Similar to the decision tree, council area is not as importance to the model.

plot(fitted_rf)

At 15 selected predictors the model metric, Rsquared, begins to fall with additional predictors using the variance splitting rule. Though the extratrees splitting rule doees not decrease after 15 predictors, the gain from each additional predictor falls off. This suggests that 15 predictors is sufficient for the model.

predict_rf <- predict(fitted_rf, testing)
R2(predict_rf, testing$Price)
## [1] 0.8605818

The Rsquared values from the training and test models are relatively similar. Choosing from both models, random forests produces the better model for regression given the data.

Given the results of the two models, removing council area as a variable would reduce the Rsquared values of both models but reduce the complexity of the models.