In this project, we will use Decision Trees and Random Forests to predict the price of a home in Melbourne, Australia.
The metric used to determine good-fit of the model is Rsquared, which determines the proportion of variability in the dependent variables that can be explained by the independent variable.
The dataset can be found on Kaggle.
library(ggplot2)
library(dplyr)
library(caret)
library(ranger)
library(rpart)
housingPrices <- read.csv("Melbourne_housing_FULL.csv")
head(housingPrices)
## Suburb Address Rooms Type Price Method SellerG
## 1 Abbotsford 68 Studley St 2 h NA SS Jellis
## 2 Abbotsford 85 Turner St 2 h 1480000 S Biggin
## 3 Abbotsford 25 Bloomburg St 2 h 1035000 S Biggin
## 4 Abbotsford 18/659 Victoria St 3 u NA VB Rounds
## 5 Abbotsford 5 Charles St 3 h 1465000 SP Biggin
## 6 Abbotsford 40 Federation La 3 h 850000 PI Biggin
## Date Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea
## 1 3/09/2016 2.5 3067 2 1 1 126 NA
## 2 3/12/2016 2.5 3067 2 1 1 202 NA
## 3 4/02/2016 2.5 3067 2 1 0 156 79
## 4 4/02/2016 2.5 3067 3 2 1 0 NA
## 5 4/03/2017 2.5 3067 3 2 0 134 150
## 6 4/03/2017 2.5 3067 3 2 1 94 NA
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 1 NA Yarra City Council -37.8014 144.9958 Northern Metropolitan
## 2 NA Yarra City Council -37.7996 144.9984 Northern Metropolitan
## 3 1900 Yarra City Council -37.8079 144.9934 Northern Metropolitan
## 4 NA Yarra City Council -37.8114 145.0116 Northern Metropolitan
## 5 1900 Yarra City Council -37.8093 144.9944 Northern Metropolitan
## 6 NA Yarra City Council -37.7969 144.9969 Northern Metropolitan
## Propertycount
## 1 4019
## 2 4019
## 3 4019
## 4 4019
## 5 4019
## 6 4019
housingPrices <- na.omit(housingPrices)
housingPrices <- housingPrices %>%
select(-c(Suburb, Address, Method, SellerG, Date, Postcode, Lattitude,
Longtitude, Propertycount))
cor(numerical)
## Rooms Price Distance Bedroom2 Bathroom
## Rooms 1.000000000 0.4750737 -0.09530502 0.96446451 0.62407018
## Price 0.475073693 1.0000000 0.11778468 0.46088037 0.46350104
## Distance -0.095305024 0.1177847 1.00000000 -0.10233076 -0.05940999
## Bedroom2 0.964464508 0.4608804 -0.10233076 1.00000000 0.62649267
## Bathroom 0.624070179 0.4635010 -0.05940999 0.62649267 1.00000000
## Car 0.401422793 0.2094641 -0.10421393 0.40556972 0.31096223
## Landsize 0.101158400 0.0583748 -0.01860843 0.10103520 0.07593872
## BuildingArea 0.606738137 0.5072844 -0.03706485 0.59529870 0.55385498
## YearBuilt 0.006934521 -0.3136639 -0.18901468 0.01631045 0.19291424
## Car Landsize BuildingArea YearBuilt
## Rooms 0.4014228 0.10115840 0.60673814 0.006934521
## Price 0.2094641 0.05837480 0.50728445 -0.313663866
## Distance -0.1042139 -0.01860843 -0.03706485 -0.189014677
## Bedroom2 0.4055697 0.10103520 0.59529870 0.016310451
## Bathroom 0.3109622 0.07593872 0.55385498 0.192914236
## Car 1.0000000 0.12349803 0.31759330 0.139254682
## Landsize 0.1234980 1.00000000 0.08322897 0.037753043
## BuildingArea 0.3175933 0.08322897 1.00000000 0.059936350
## YearBuilt 0.1392547 0.03775304 0.05993635 1.000000000
From the correlation matrix pairs, pairs involving BuildingArea such as Rooms, Bedrooms2, and Bathrooms have high degrees of correlation. In particular Bedroom2 and Rooms have the highest degree of correlation at 0.96446451. In this case, I have removed Bedroom2 from the model.
ggplot(housingPrices, aes(Type, Price)) +
geom_boxplot(outlier.colour = "black") +
scale_x_discrete(labels = c('Houses','Townhouses','Units')) +
scale_y_continuous(breaks=seq(0,10000000,1250000)) +
xlab("Type of Home") +
ylab("Price") +
ggtitle("Price Distribution of Home Type")
housingPrices %>%
filter(YearBuilt != 1196) %>%
select(Price, Type, YearBuilt) %>%
group_by(YearBuilt, Type) %>%
summarise(Mean = mean(Price)) %>%
ggplot(aes(x=YearBuilt, y=Mean, color=Type)) +
geom_smooth(method = 'loess') +
geom_line() +
scale_color_discrete(name = "Type of Home",
labels=c("House", "Townhouse", "Unit")) +
xlab("Year") +
ylab("Price") +
ggtitle("Average Price of Homes")
Over the years there has been a decrease in average prices of houses, with houses increasing in price around 2000. Average prices of townhouses and units have stayed relatively stable.
ggplot(housingPrices, aes(Regionname, Price)) +
geom_boxplot(outlier.colour = "black") +
scale_y_continuous(breaks=seq(0,10000000,1250000)) +
theme(legend.position = "none") +
xlab("Region") +
ylab("Price") +
ggtitle("Region Price Distributions") +
coord_flip()
Homes in the Southern Metropolitan region has the highest average price. This makes sense since this region is the center-most region in Melbourne.
housingPrices %>%
filter(YearBuilt != 1196) %>%
ggplot(aes(x=YearBuilt, y=Price, color=Regionname)) +
scale_x_continuous(breaks=seq(1800,2020,25)) +
geom_smooth(method = 'loess') +
xlab("Year") +
ylab("Price") +
ggtitle("Region Prices") +
labs(color = "Region")
We can observe the years of region development, with some regions being new or older than others. Homes in the southern regions have the steepest-rise in price post-1975, while other regions have had a relative increase in prices.
housingPrices %>%
group_by(CouncilArea) %>%
summarise(Count = n()) %>%
ggplot(aes(reorder(CouncilArea, Count), Count)) +
geom_bar(stat = 'identity') +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
coord_flip() +
xlab("Council Area") +
ylab("Count") +
ggtitle("Council Area Frequencies")
There are many council areas with Moorabool Shire having 1 home and Boroondara City having over 800 homes. To reduce the number of levels for this variable I will convert levels with fewer than 103 counts to “Other”.
housingPrices$BuildingAge <- 2020 - housingPrices$YearBuilt
housingPrices$BuildingAge <- as.integer(housingPrices$BuildingAge)
housingPrices <- housingPrices %>%
filter(BuildingAge != 824, Price != 9000000, BuildingArea < 3000)
temp <- housingPrices %>%
select(CouncilArea) %>%
group_by(CouncilArea) %>%
count(CouncilArea) %>%
arrange(desc(n)) %>%
filter(n <= 103)
housingPrices$CouncilArea <- as.character(housingPrices$CouncilArea)
housingPrices <- housingPrices %>%
mutate(CouncilArea = replace(CouncilArea, CouncilArea %in% temp[1]$CouncilArea,
"Other"))
housingPrices$CouncilArea <- as.factor((housingPrices$CouncilArea))
dummy_obj <- dummyVars(~ Type + Regionname + CouncilArea, data = housingPrices,
sep = ".", levelsOnly = TRUE)
dummies <- predict(dummy_obj, housingPrices)
housingPrices <- cbind(housingPrices, dummies)
housingPrices <- housingPrices %>%
select(-c(Type, Regionname, CouncilArea))
Used logarithmic transformation of response variable to linearize relationship between dependent variables.
set.seed(444)
data <- housingPrices %>%
select(-c(Bedroom2, YearBuilt))
data$Price <- log(data$Price)
colnames(data) <- make.names(colnames(data))
train_ind <- createDataPartition(y = data$Price, p = 0.8, list = FALSE)
training <- data[train_ind,]
testing <- data[-train_ind,]
dt_ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
fitted_dt <- train(Price~., data = training,
trControl = dt_ctrl,
tuneLength = 10,
method = "rpart",
metric = "Rsquared")
fitted_dt
## CART
##
## 7109 samples
## 42 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 6398, 6398, 6398, 6398, 6399, 6398, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01369936 0.3353080 0.6054433 0.2611066
## 0.01370926 0.3356166 0.6047248 0.2613788
## 0.01485158 0.3480106 0.5752050 0.2721267
## 0.01512019 0.3501060 0.5700850 0.2739008
## 0.01717110 0.3589371 0.5480675 0.2816320
## 0.03018552 0.3679187 0.5249618 0.2896105
## 0.04792287 0.3832321 0.4846719 0.3041181
## 0.10816361 0.4094428 0.4111957 0.3251637
## 0.11991666 0.4544823 0.2749336 0.3643463
## 0.23875988 0.5069134 0.2140251 0.4094569
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01369936.
plot(varImp(fitted_dt))
Variables such as council area are not as important to the decision tree model.
plot(fitted_dt$finalModel)
text(fitted_dt$finalModel)
Displays the splits the decision tree made.
predict_dt <- predict(fitted_dt, testing)
R2(predict_dt, testing$Price)
## [1] 0.6246983
Both training and testing Rsquared values are relatively close resulting in a good fitted model.
rf_ctrl <- trainControl(method = "cv", number = 5)
fitted_rf <- train(Price~., data = training,
trControl = rf_ctrl,
tuneLength = 10,
method = "ranger",
importance = "permutation",
metric = "Rsquared",
verbose = TRUE)
fitted_rf
## Random Forest
##
## 7109 samples
## 42 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 5688, 5687, 5688, 5687, 5686
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 0.2923992 0.8066375 0.2273218
## 2 extratrees 0.3437727 0.7339959 0.2695868
## 6 variance 0.2071892 0.8573710 0.1560067
## 6 extratrees 0.2361321 0.8176019 0.1797260
## 10 variance 0.1967010 0.8668429 0.1470099
## 10 extratrees 0.2089560 0.8507330 0.1577454
## 15 variance 0.1951739 0.8678133 0.1455961
## 15 extratrees 0.1987813 0.8630721 0.1493470
## 19 variance 0.1953982 0.8671074 0.1456743
## 19 extratrees 0.1963984 0.8659704 0.1472066
## 24 variance 0.1964517 0.8654543 0.1461907
## 24 extratrees 0.1954361 0.8670459 0.1464573
## 28 variance 0.1973864 0.8640132 0.1467639
## 28 extratrees 0.1947382 0.8679373 0.1458742
## 33 variance 0.1990132 0.8616162 0.1477506
## 33 extratrees 0.1946094 0.8679752 0.1457245
## 37 variance 0.2007517 0.8590796 0.1487021
## 37 extratrees 0.1947529 0.8677406 0.1457078
## 42 variance 0.2043512 0.8537970 0.1510615
## 42 extratrees 0.1951966 0.8670938 0.1458868
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## Rsquared was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 33, splitrule =
## extratrees and min.node.size = 5.
print(fitted_rf$finalModel)
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = ".outcome", data = x, mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size, splitrule = as.character(param$splitrule), write.forest = TRUE, probability = classProbs, ...)
##
## Type: Regression
## Number of trees: 500
## Sample size: 7109
## Number of independent variables: 42
## Mtry: 33
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: extratrees
## Number of random splits: 1
## OOB prediction error (MSE): 0.03694332
## R squared (OOB): 0.8703749
plot(varImp(fitted_rf))
Similar to the decision tree, council area is not as importance to the model.
plot(fitted_rf)
At 15 selected predictors the model metric, Rsquared, begins to fall with additional predictors using the variance splitting rule. Though the extratrees splitting rule doees not decrease after 15 predictors, the gain from each additional predictor falls off. This suggests that 15 predictors is sufficient for the model.
predict_rf <- predict(fitted_rf, testing)
R2(predict_rf, testing$Price)
## [1] 0.8605818
The Rsquared values from the training and test models are relatively similar. Choosing from both models, random forests produces the better model for regression given the data.
Given the results of the two models, removing council area as a variable would reduce the Rsquared values of both models but reduce the complexity of the models.