This data set on Housing Price is downloaded from Kaggle and contains 34,857 observations and 21 variables. It will be prepared for machine learning models to predict property price.
Regression Tree (Decision Tree where the dependent variable is continuous) will be used in modeling. Afterwards, Gradient Boosting Machine (GBM) will be adopted to enhance predictive capability by combining an optimal number of trees.
This RMarkdown file records the data preparation process.
houses_raw<-houses_raw %>%
#Change house type to be factor, value names to be more readable
mutate(Type=as.factor(recode(Type,
"h"="House",
"u"="Unit",
"t"="Townhouse")),
#Change date from character to analyzable date
Date=dmy(Date),
#Add a year variable and a month variable for analysis
#Month should be factor rather than numeric
Month=month(Date, label = TRUE, abbr = TRUE),
Year=year(Date),
#Change Postcode to be character instead of numeric
Postcode=as.character(Postcode))
#There is one house with abdominal year value=2106, which is likely to be a typo (2016?), but it doesn't have value for Price so won't be included in the model anyway
houses_raw %>%
arrange(-YearBuilt) %>%
select(Address,Rooms, Type, Price, YearBuilt) %>%
slice_head(n=5)
## # A tibble: 5 x 5
## Address Rooms Type Price YearBuilt
## <chr> <dbl> <fct> <dbl> <dbl>
## 1 3 Maringa St 4 House NA 2106
## 2 1 Wyuna Ct 3 House 1100000 2019
## 3 8 Thomas St 2 House 1310000 2018
## 4 22 Derreck Av 4 House 1310000 2018
## 5 40 Warwick Rd 3 House 890000 2018
First of all, houses with Price missing will not be include in the analysis. After dropping 7610 houses with no available price, the dataset now has 27247 observations. The following plot check missing values among these observations.
Tree models use only a single variable at a time for any split. Therefore, missing values becomes a less concerning issue. But to feed as much as useful info into the model, let’s see whether any imputation can be done. One possibility is that NA in some variable indicates value zero, this is possible for Car, Bathroom and Bedroom2, but upon checking they all have value zero, hence no imputation attempted in this project.
##
## Bathroom Bedroom2 Car
## --------- ---------- ---------- -------
## Min 0.00 0.00 0.00
## Max 9.00 20.00 18.00
Unlike linear regression, regression trees have no assumptions about the shape and the distribution of the data.
Price is right-skewed, which is not particularly problematic since we are using decision tree method. Discuss with estate domain experts about whether high-value properties should be distinguished in valuation. For example, followed please see 145 properties with values higher than 4,000,000 (which account for 0.53 % of all the properties in the dataset. They are all included in the analysis for now. Dashboard like this can also be used to filter property values in exploratory analysis.
By minimizing ANOVA when forming partitions, regression tree method results in relative equal variance among partitions, and hence could be affected by outliers, with the effect limited to relevant partitions. Therefore, regression tree is relatively robust to outliers, but it is still helpful to check for extreme outliers.
Statistical tests of outliers, such as Brubbs’s test, Dixon’s test and Rosner’s test, require data to be normally distributed. Therefore, in this project, boxplot and percentile will be used to identify outliers.
Boxplot uses IQR criteria to identify outliers. But as discussed, because regression tree method is relatively robust to outliers, let’s only drop extreme ones if there is any.
Landsize_cut=0.0001
#upper_bound can be changed if necessary
upper_bound <- quantile(houses$Landsize, 0.9999, na.rm=TRUE)
#Outliers identified
outliers <- houses %>%
filter(Landsize>=upper_bound) %>%
select(Suburb, Address, Rooms, Type, Price, Date, Landsize, BuildingArea)
#Go on without these two outliers, careful to add the 2nd condition or else all observation with Landsize missing will be gone
houses <- houses %>%
filter(Landsize < upper_bound | is.na(Landsize))
If we set the criteria to properties with the highest 0.01 % of landsize, 2 outliers will be identified and dropped from the analysis. Please see the table as follows.
Unsurprisingly, The extreme outliers in BUilding Area correspond to extreme outliers in Landsize, and were excluded already.
Here is the top 10 houses with the earliest YearBuilt according to data. The first house with YearBuilt valued as 1196 is an outlier, and very likely a typo. After checking property history that value is changed to 1960.
## # A tibble: 10 x 4
## Address Suburb Price YearBuilt
## <chr> <chr> <dbl> <dbl>
## 1 5 Armstrong St Mount Waverley 1200000 1196
## 2 146 Pigdon St Carlton North 720000 1820
## 3 2/79 Oxford St Collingwood 855000 1830
## 4 11 Henry St Fitzroy 677000 1850
## 5 602/220 Commercial Rd Prahran 841000 1850
## 6 22a Stanley St Richmond 1600000 1850
## 7 51/167 Fitzroy St St Kilda 1600000 1850
## 8 52 Nicholson St Fitzroy 3310000 1854
## 9 352 Moray St South Melbourne 2260000 1856
## 10 147 Bank St South Melbourne 2200000 1857
## [1] "C:/Users/ningw/Dropbox/Courses/2021 Data Science and Decision Making/Week8 ComputationalModels RegressionTree/House Price Prediction ML"
Finally, let’s look at some exploratory plots to check the distribution of and correlations between some key factors. I also find it helpful to put exploratory summary plots on PowerBI or Tableau dashboard, and then use the filter tool there to get a quick understanding of data across relevant sub-groups. Here is an example based on this dataset.