1 Overview

This data set on Housing Price is downloaded from Kaggle and contains 34,857 observations and 21 variables. It will be prepared for machine learning models to predict property price.

Regression Tree (Decision Tree where the dependent variable is continuous) will be used in modeling. Afterwards, Gradient Boosting Machine (GBM) will be adopted to enhance predictive capability by combining an optimal number of trees.

This RMarkdown file records the data preparation process.

2 Problem Detection

  • Assumptions Checking
      1. Outliers: Regression tree is relatively robust to outliers, but Landsize, BuildingArea and YearBuilt have extreme outliers that should be checked
      1. Distribution and Shape: Regression tree is robust to shapes and distributions, but we will nonetheless take a look at the distributions of key variables.
      1. Missings: Regression tree is robust to missing values, but we will check to see whether we could add more info by imputing some values accurately
  • Variable Types to be adjusted:
      1. Postcode: Numeric –> Character
      1. Date: Character –> Date
  • Variable Values to be checked/adjusted
      1. YearBuild has a maximum value of 2106 maybe due to typo, will check values
      1. Type: Values of house type can be made more readable (e.g., h –> House)

3 Variable type and naming

houses_raw<-houses_raw %>% 
        #Change house type to be factor, value names to be more readable
  mutate(Type=as.factor(recode(Type,
                          "h"="House",
                          "u"="Unit",
                          "t"="Townhouse")),
         #Change date from character to analyzable date
         Date=dmy(Date),
         #Add a year variable and a month variable for analysis
         #Month should be factor rather than numeric
         Month=month(Date, label = TRUE, abbr = TRUE),
         Year=year(Date),
         #Change Postcode to be character instead of numeric
         Postcode=as.character(Postcode))

#There is one house with abdominal year value=2106, which is likely to be a typo (2016?), but it doesn't have value for Price so won't be included in the model anyway
houses_raw %>% 
  arrange(-YearBuilt) %>% 
  select(Address,Rooms, Type, Price, YearBuilt) %>% 
  slice_head(n=5)
## # A tibble: 5 x 5
##   Address       Rooms Type    Price YearBuilt
##   <chr>         <dbl> <fct>   <dbl>     <dbl>
## 1 3 Maringa St      4 House      NA      2106
## 2 1 Wyuna Ct        3 House 1100000      2019
## 3 8 Thomas St       2 House 1310000      2018
## 4 22 Derreck Av     4 House 1310000      2018
## 5 40 Warwick Rd     3 House  890000      2018

4 Misssings

First of all, houses with Price missing will not be include in the analysis. After dropping 7610 houses with no available price, the dataset now has 27247 observations. The following plot check missing values among these observations.

Tree models use only a single variable at a time for any split. Therefore, missing values becomes a less concerning issue. But to feed as much as useful info into the model, let’s see whether any imputation can be done. One possibility is that NA in some variable indicates value zero, this is possible for Car, Bathroom and Bedroom2, but upon checking they all have value zero, hence no imputation attempted in this project.

## 
##             Bathroom   Bedroom2     Car
## --------- ---------- ---------- -------
##       Min       0.00       0.00    0.00
##       Max       9.00      20.00   18.00

5 Distribution

Unlike linear regression, regression trees have no assumptions about the shape and the distribution of the data.

Price is right-skewed, which is not particularly problematic since we are using decision tree method. Discuss with estate domain experts about whether high-value properties should be distinguished in valuation. For example, followed please see 145 properties with values higher than 4,000,000 (which account for 0.53 % of all the properties in the dataset. They are all included in the analysis for now. Dashboard like this can also be used to filter property values in exploratory analysis.

6 Outlier Detection

By minimizing ANOVA when forming partitions, regression tree method results in relative equal variance among partitions, and hence could be affected by outliers, with the effect limited to relevant partitions. Therefore, regression tree is relatively robust to outliers, but it is still helpful to check for extreme outliers.

Statistical tests of outliers, such as Brubbs’s test, Dixon’s test and Rosner’s test, require data to be normally distributed. Therefore, in this project, boxplot and percentile will be used to identify outliers.

6.1 Landsize

Boxplot uses IQR criteria to identify outliers. But as discussed, because regression tree method is relatively robust to outliers, let’s only drop extreme ones if there is any.

Landsize_cut=0.0001

#upper_bound can be changed if necessary
upper_bound <- quantile(houses$Landsize, 0.9999, na.rm=TRUE)

#Outliers identified
outliers <- houses %>% 
  filter(Landsize>=upper_bound) %>% 
  select(Suburb, Address, Rooms, Type, Price, Date, Landsize, BuildingArea)

#Go on without these two outliers, careful to add the 2nd condition or else all observation with Landsize missing will be gone
houses <- houses %>% 
  filter(Landsize < upper_bound | is.na(Landsize)) 

If we set the criteria to properties with the highest 0.01 % of landsize, 2 outliers will be identified and dropped from the analysis. Please see the table as follows.

6.2 Building Area

Unsurprisingly, The extreme outliers in BUilding Area correspond to extreme outliers in Landsize, and were excluded already.

6.3 YearBuilt

Here is the top 10 houses with the earliest YearBuilt according to data. The first house with YearBuilt valued as 1196 is an outlier, and very likely a typo. After checking property history that value is changed to 1960.

## # A tibble: 10 x 4
##    Address               Suburb            Price YearBuilt
##    <chr>                 <chr>             <dbl>     <dbl>
##  1 5 Armstrong St        Mount Waverley  1200000      1196
##  2 146 Pigdon St         Carlton North    720000      1820
##  3 2/79 Oxford St        Collingwood      855000      1830
##  4 11 Henry St           Fitzroy          677000      1850
##  5 602/220 Commercial Rd Prahran          841000      1850
##  6 22a Stanley St        Richmond        1600000      1850
##  7 51/167 Fitzroy St     St Kilda        1600000      1850
##  8 52 Nicholson St       Fitzroy         3310000      1854
##  9 352 Moray St          South Melbourne 2260000      1856
## 10 147 Bank St           South Melbourne 2200000      1857
## [1] "C:/Users/ningw/Dropbox/Courses/2021 Data Science and Decision Making/Week8 ComputationalModels RegressionTree/House Price Prediction ML"

7 Exploratory plots

Finally, let’s look at some exploratory plots to check the distribution of and correlations between some key factors. I also find it helpful to put exploratory summary plots on PowerBI or Tableau dashboard, and then use the filter tool there to get a quick understanding of data across relevant sub-groups. Here is an example based on this dataset.

8 Next Step: Decision Tree and GBM