Introduction

This kernel’s goal is to establish an advanced regression model to predict housing price in Melbourne, Australia. The data is a snapshot in September 2017 of Melbourne housing clearance data created by Tony Pino. A well-tuned model can help discover a reasonably priced property.

Data Exploration

Data Structure

melb_house <- read_csv("melb_data.csv", na = "")
melb_house$Date <- as.Date(melb_house$Date, format = "%d/%m/%Y") # format date column
glimpse(melb_house)
Rows: 13,580
Columns: 21
$ Suburb        <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbotsfo…
$ Address       <chr> "85 Turner St", "25 Bloomburg St", "5 Charles St", …
$ Rooms         <dbl> 2, 2, 3, 3, 4, 2, 3, 2, 1, 2, 2, 3, 2, 2, 1, 2, 3, …
$ Type          <chr> "h", "h", "h", "h", "h", "h", "h", "h", "u", "h", "…
$ Price         <dbl> 1480000, 1035000, 1465000, 850000, 1600000, 941000,…
$ Method        <chr> "S", "S", "SP", "PI", "VB", "S", "S", "S", "S", "S"…
$ SellerG       <chr> "Biggin", "Biggin", "Biggin", "Biggin", "Nelson", "…
$ Date          <date> 2016-12-03, 2016-02-04, 2017-03-04, 2017-03-04, 20…
$ Distance      <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2…
$ Postcode      <dbl> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 306…
$ Bedroom2      <dbl> 2, 2, 3, 3, 3, 2, 4, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, …
$ Bathroom      <dbl> 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, …
$ Car           <dbl> 1, 0, 0, 1, 2, 0, 0, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, …
$ Landsize      <dbl> 202, 156, 134, 94, 120, 181, 245, 256, 0, 220, 0, 2…
$ BuildingArea  <dbl> NA, 79, 150, NA, 142, NA, 210, 107, NA, 75, NA, 190…
$ YearBuilt     <dbl> NA, 1900, 1900, NA, 2014, NA, 1910, 1890, NA, 1900,…
$ CouncilArea   <chr> "Yarra", "Yarra", "Yarra", "Yarra", "Yarra", "Yarra…
$ Lattitude     <dbl> -37.7996, -37.8079, -37.8093, -37.7969, -37.8072, -…
$ Longtitude    <dbl> 144.9984, 144.9934, 144.9944, 144.9969, 144.9941, 1…
$ Regionname    <chr> "Northern Metropolitan", "Northern Metropolitan", "…
$ Propertycount <dbl> 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019, 401…
print(paste("The start date is", min(melb_house$Date)))
[1] "The start date is 2016-01-28"
print(paste("The end date is", max(melb_house$Date)))
[1] "The end date is 2017-09-23"

The dataset contains 13580 properties with 21 attributes. It covers the period from 28/01/2016 to 23/09/2017.

Target Variable

As we want to predict the housing price, Price will be the target variable.

summary(melb_house$Price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  85000  650000  903000 1075684 1330000 9000000 
ggplot(melb_house, aes(Price)) +
  geom_histogram(fill = "deepskyblue3", binwidth = 100000)

The sale prices are right skewed, with some outliers in the highly priced range.

Feature Selection

Handle Missing Data

First we plot all the variables that have missing values.

plot_missing(melb_house, missing_only = TRUE)

There are 4 variables with missing values. The BuildingArea has almost half of the values are NAs, therefore it will be removed in the analysis. For the variable YearBuilt, there are 39.58% NAs. Because this variable may imply the house’s condition, we will replace the NAs with Not Available and try binning the others into groups. Lets take a look at the quantile of those values:

quantile(melb_house$YearBuilt, na.rm = TRUE)
  0%  25%  50%  75% 100% 
1196 1940 1970 1999 2018 

There seems to be really old properties dating back to the year 1196. Looking at the quantile we propose 4 approximately similar groups of house based on age: Older than 80 years, 50-80 years, 20-50 years, Less than 20 years.

melb_house_prep <- melb_house %>% 
  mutate(Age = if_else(is.na(YearBuilt), "Not Available"
           , if_else(YearBuilt < 1937, "Older than 80 years" 
                       , if_else(YearBuilt < 1967, "50-80 years"
                               , if_else(YearBuilt < 1997, "20-50 years"
                                       , "Less than 20 years"))))
         )
ggplot(melb_house_prep, aes(Age)) +
  geom_bar(fill = "deepskyblue3")

For the Car variable, there are only 0.46% NAs where the number of car slots are not informed. We will consider them as no parking slots.

melb_house_prep$Car[is.na(melb_house_prep$Car)] <- 0
ggplot(melb_house_prep, aes(Car)) +
  geom_histogram(fill = "deepskyblue3", binwidth = 0.5) +
  scale_x_continuous(breaks = seq(0, 10, 1))

For CouncilArea, there are other variables that refer to location, such as Suburb and Postcode. Let us consider their correlation using chi-squared test and Cramer’s V value:

melb_house$Suburb <- factor(melb_house$Suburb)
melb_house$Postcode <- factor(melb_house$Postcode)
df <- melb_house %>% select(Suburb, Postcode, CouncilArea)
df$CouncilArea <- df$CouncilArea %>% replace_na("Not Available")
df$CouncilArea <- factor(df$CouncilArea)

# function to get chi square p value and Cramers V
f <- function(x,y) {
  tbl <- df %>% select(x,y) %>% table()
  chisq_pval <- round(chisq.test(tbl)$p.value, 4)
  cramV <- round(cramersV(tbl), 4) 
  data.frame(x, y, chisq_pval, cramV) }

# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb <- data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res <- map2_df(df_comb$X1, df_comb$X2, f)
df_res %>%
  ggplot(aes(x, y, fill = chisq_pval))+
  geom_tile()+
  geom_text(aes(x,y,label=cramV))+
  scale_fill_gradient(low="red", high="yellow")+
  theme_classic()

There are strong correlations between each of the 3 categorical variables based on the high Cramer’s V and the p-value of chi-squared test close to 0. Therefore, we will only use Suburb in the prediction model and drop the other 2 variables as there will be little information loss. For the same reason, Address, Lattitude and Longtitude can also be dropped as they only contain unique information about location.

Transformation of Date variable

The date that the house is sold could be of some values, even though it is not really informative with just the exact date. We will separate the variable Date into MonthYear indicating the month and the year it is in.

melb_house_prep <- melb_house_prep %>% 
  mutate(MonthYear = format(as.Date(Date), "%Y-%b"))
melb_house_prep$Suburb <- factor(melb_house_prep$Suburb)
melb_house_prep$Type <- factor(melb_house_prep$Type)
melb_house_prep$Method <- factor(melb_house_prep$Method)
melb_house_prep$SellerG <- factor(melb_house_prep$SellerG)
melb_house_prep$Regionname <- factor(melb_house_prep$Regionname)
melb_house_prep$Age <- factor(melb_house_prep$Age)
# Order month name
monthlevel <- paste0(rep(2016:2017, each = 12), "-", month.abb)
melb_house_prep$MonthYear <- factor(melb_house_prep$MonthYear
                                    , levels = monthlevel)

Correlation of continuous variables

Next we will check the correlation plot. There are 7 continuous predictor variables remained. Let us check if they are correlated with Price sold.

# Reduce the variables in the dataset
melb_house_prep <- melb_house_prep %>% 
  select(Price, Suburb, Rooms, Type, Method, SellerG, Distance, Bedroom2
         , Bathroom, Car, Landsize, Regionname, Propertycount, Age, MonthYear)
# Index vector numeric variables
numericVars <- which(sapply(melb_house_prep, is.numeric))
numericVarNames <- names(numericVars) #saving names vector for use later on
# Correlations of all numeric variables
all_numVar <- melb_house_prep[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

The number of rooms, bedrooms and bathrooms seems to be correlated with the price of the house. Rooms and Bedrooms are also highly correlated, so we can drop Bedrooms in the modelling process. We will also omit PropertyCount because it has an one on one representation of each Suburb.

melb_house_prep <- melb_house_prep %>%
  select(-Bedroom2, -Propertycount)

Data Pre-processing

After transforming and reducing the number of predictors:

newnumericVars <- melb_house_prep %>% 
  select(where(is.numeric))
nonnumericVars <- melb_house_prep %>% 
  select(!where(is.numeric))
cat('There are', length(newnumericVars) - 1, 'continuous predictor variables'
    , 'and', length(nonnumericVars), 'categorical predictor variables.')
There are 5 continuous predictor variables and 7 categorical predictor variables.

Continuous variables

plot_histogram(newnumericVars, ncol = 3L)

The continous variables need to be normalized to alleviate skewness. The histogram shows some abnormal pattern in Landsize. A better look at its boxplot shows an extreme outlier. We shall remove that instance for better model prediction.

ggplot(melb_house_prep, aes(Landsize)) +
  geom_boxplot()+
  coord_flip()

melb_house_prep <- melb_house_prep[-which(melb_house_prep$Landsize == max(melb_house_prep$Landsize)), ]

Categorical variables

Now looking at the categorical variables, Suburb has 314 categories and SellerG has 268 categories, which will need binning.

# Plot the categorical variables
plot_bar(nonnumericVars, ncol = 3L)
2 columns ignored with more than 50 categories.
Suburb: 314 categories
SellerG: 268 categories

The contingency table for MonthYear shows that there are very few observations in 2016-Jan and 2016-Feb, and there were even no house sale in 2016-Mar and 2017-Jan (there were no problem in the contingency table for other categorical variables).

table(melb_house_prep$MonthYear)

2016-Jan 2016-Feb 2016-Mar 2016-Apr 2016-May 2016-Jun 2016-Jul 2016-Aug 
       2       26        0      318      899      732      446      716 
2016-Sep 2016-Oct 2016-Nov 2016-Dec 2017-Jan 2017-Feb 2017-Mar 2017-Apr 
     925      551     1114      607        0      417      678      631 
2017-May 2017-Jun 2017-Jul 2017-Aug 2017-Sep 2017-Oct 2017-Nov 2017-Dec 
    1130     1098     1536      833      920        0        0        0 

To prevent overfitting, we will bin 2016-Jan and 2016-Feb together.

melb_house_prep <- melb_house_prep %>% 
  mutate(MonthYearNew = ifelse(MonthYear %in% c("2016-Jan", "2016-Feb"),
                               "2016-JanFeb", as.character(MonthYear))) %>% 
  select(-MonthYear)

Sampling and data partitioning

We split the data to begin the modelling process by performing stratified sampling to keep the similar target distribution. The training set will keep 70% of the observations and the test set keeps 30%.

# Stratified sampling with the rsample package
set.seed(1235)
split_strat  <- initial_split(melb_house_prep, prop = 0.7
                              , strata = "Price")
melb_train  <- training(split_strat)
melb_test   <- testing(split_strat)

Feature engineering

Examining the categorical variables, there are no near-zero variance variables.

# Check for near-zero variances
nearZeroVar(melb_train, saveMetrics = TRUE) %>% 
  tibble::rownames_to_column() %>% 
  filter(nzv)

In the next step we execute normalization to minimize the skewness of numerical variables. We use log transform for the target variable Price and Yeo-Johnson transformation for the predictors since the values are not strictly positive. We also standardize the numerical predictors so that they have the same magnitude in the models. Then, as said above, we will reduce the number of categories in Suburb and SellerG. All levels that are observed in less than 0.5% of the training sample will be collapsed into an Other category. We have to carefully choose a small percentage because of possible information loss. Lastly, we use dummy encoding to transform the categorical variables into numeric representations. The feature engineering steps in summary are: 1. Normalize numeric features. 2. Standardize numeric predictor features. 3. Binning Suburb and SellerG. 4. Dummy encode categorical features.

# Create recipe for pre-processing
blueprint_all <- recipe(Price ~ ., data = melb_train) %>%
  step_log(all_outcomes()) %>% 
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_center(all_numeric(), -all_outcomes()) %>% 
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_other(Suburb, threshold = 0.005, other = "Other") %>% 
  step_other(SellerG, threshold = 0.005, other = "Other") %>% 
  step_dummy(all_nominal())
prepare <- prep(blueprint_all, training = melb_train)
baked_train <- bake(prepare, new_data = melb_train) # df after preprocess
#plot_bar(baked_train) # check the bar plot

Modelling

We are going to test 3 types of regression models on the data: a simple linear regression model, a regularized model and a light gradient boosting model. 10-fold cross-validation sampling is used for evaluation.

# Define a cross validation method
tr_control <- trainControl(method = "cv"
                           , number = 10
                           , savePredictions = TRUE)

Linear model and Regularized model

Using the caret package, the linear model and the regularized model can be pre-processed and trained in the same fashion.

set.seed(1235)  # for reproducibility
cv_lm <- train(
  blueprint_all
  , data = melb_train
  , method = "lm"
  , trControl = tr_control
)

The regularized model, or elastic net, has 2 hyperparameters alpha and lambda. We set tunelength = 10 to do a grid search of the 10 default values of each parameter.

set.seed(1235)
cv_en <- train(
  blueprint_all
  , data = melb_train
  , method = "glmnet"
  , trControl = tr_control
  , tuneLength = 10
)

Based on the plot of the performance measures, there are no significant differences between the linear model and the regularized model. In this case, the linear model might be a better choice because of its interpretability.

resamps <- resamples(list(
  LM = cv_lm
  , GLMnet = cv_en
))
summary(resamps)

Call:
summary.resamples(object = resamps)

Models: LM, GLMnet 
Number of resamples: 10 

MAE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LM     0.1821066 0.1872720 0.1905543 0.1901902 0.1937587 0.1979388    0
GLMnet 0.1821618 0.1874446 0.1906834 0.1902476 0.1938839 0.1979971    0

RMSE 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LM     0.2345404 0.2452400 0.2470586 0.2498114 0.2544200 0.2652404    0
GLMnet 0.2345297 0.2454189 0.2470686 0.2498076 0.2543727 0.2650484    0

Rsquared 
            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
LM     0.7439817 0.7734255 0.7778464 0.7780264 0.7880779 0.8061021    0
GLMnet 0.7443194 0.7732747 0.7776512 0.7780218 0.7878595 0.8061277    0
trellis.par.set(caretTheme())
Warning: Note: The default device has been opened to honour attempt to modify trellis settings
bwplot(resamps)

The RMSE is after log transformation of the target variable. Reversing it to normal scale can help with practical reasoning.

rmset <- function(model){
  y = log(melb_train$Price)
  y.pred = predict(model, melb_train)
  y <- exp(y)
  y.pred <- exp(y.pred)
  return(sqrt(mean((y - y.pred)^2)))
}

paste("The RMSE of the linear model is:", rmset(cv_lm))
[1] "The RMSE of the linear model is: 346096.995987272"

Light gradient boosting machine

Gradient Boosting Machine is a modern machine learning technique that makes use of an ensemble of weak learners, typically decision trees. Light GBM (by Microsoft) is a high-performance framework that offers significantly fast training speed and accuracy thanks to leaf-wise growth and other features. To fit the model, first we need to convert the training data into a matrix that is supported by lightgbm package.


# Convert into lgb dataset
new_melb <- lgb.convert_with_rules(melb_train %>% select(-Price))
train_label <- melb_train[, "Price"]
train_matrix <- lgb.Dataset(data = as.matrix(new_melb$data)
                            , label = as.matrix(train_label))

There are a variety of hyperparameters in lightgbm. We will perform tuning for 4 important hyperparameters: 1. num_leaves: the maximum number of leaves that each learners has. 2. max_depth: the maximum depth of each trained tree. 3. bagging_fraction: the percentage of rows used per iteration. 4. feature_fraction: the percentage of random subset of features per iteration.

Hyperparameter tuning will be carried out by Bayesian Optimization, which uses the information from prior model evaluations to guide future parameter searches. In this method, the prior response distribution is iteratively updated based on our best guess of where the best parameters are. The package used is ParBayesianOptimization.

# Define scoring function for BayesOpt
scoringFunction <- function(num_leaves, max_depth
                            , bagging_fraction, feature_fraction) {
  
  Pars <- list(
    objective = "regression"
    , metric = "rmse"
    , num_iterations = 1000
    , early_stopping_rounds = 50
    , learning_rate = 0.01
    , num_leaves = num_leaves
    , max_depth = max_depth
    , bagging_fraction = bagging_fraction
    , bagging_freq = 1
    , feature_fraction = feature_fraction
  )
  
  lgbcv <- lgb.cv(
    params = Pars
    , data = train_matrix
    , nfold = 5L
    , stratified = FALSE
  )
  
  return(list(Score = -1*lgbcv$best_score
              , nrounds = lgbcv$best_iter
  )
  )
}

# Define the search boundaries
bounds <- list(
  num_leaves = c(5L, 30L)
  , max_depth = c(2L, 5L)
  , bagging_fraction = c(0.4, 1)
  , feature_fraction = c(0.4, 1)
)

We define a scoring function to optimize. Then, we set the search boundaries for each hyperparameter. BayesOpt then try find parameters that maximize the scoring function based on some initial scoring.

start.time <- Sys.time()
set.seed(1235)

optObj <- bayesOpt(
  FUN = scoringFunction
  , bounds = bounds
  , initPoints = 5
  , iters.n = 20
  , iters.k = 1
)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

The results of 20 iterations are as below.

optObj$scoreSummary

The best tuned hyperparameters are num_leaves = 21, max_depth = 5, bagging_fraction = 0.741 and feature_fraction = 0.658.

getBestPars(optObj)
$num_leaves
[1] 21

$max_depth
[1] 5

$bagging_fraction
[1] 0.740829

$feature_fraction
[1] 0.6581308

The RMSE of the Light GBM model is better than the Linear model. For prediction purpose we proceed with the Light GBM model with the best hyperparameters, although we can choose the linear model if the goal is interpretability.

-1*max(optObj$scoreSummary$Score)
[1] 301154.3

Final model performance

We use the chosen hyperparameters to train a Light GBM model with higher iterations.


# Hyperparameter
params <- list(
  objective = "regression"
  , metric = "RMSE"
  , num_iterations = 5000
  , num_threads = 2
  , learning_rate = 0.01
  , num_leaves = 21
  , max_depth = 5
  , seed = 1235
  , bagging_fraction = 0.741
  , bagging_freq = 1
  , feature_fraction = 0.658
)

# Train lgb model
lgb_fin <- lgb.train(params = params, data = train_matrix)

Applying the final model to the unseen test set yields the following results:

# Convert test set into lgb dataset
new_melb_test <- lgb.convert_with_rules(melb_test %>% select(-Price))
test_label <- melb_test$Price
# Predict test data
preds <- predict(lgb_fin, data = as.matrix(new_melb_test$data))

# Accuracy check
mse <- mean((test_label - preds)^2)
mae <- caret::MAE(test_label, preds)
rmse <- caret::RMSE(test_label, preds)
rsq <- function (x, y) cor(x, y) ^ 2
r_squared <- rsq(test_label, preds)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse, "\nR-squared: ", r_squared)
MSE:  76403505601 
MAE:  167015.6 
RMSE:  276411.8 
R-squared:  0.8018303

The plot between actual and predicted values is as follow:

df_visual <- data.frame(test_label, preds)
df_visual$observations <- 1:nrow(df_visual)
ggplot(data = df_visual) + 
  geom_line(aes(x = observations, y = test_label, color = 'test')) + 
  geom_line(aes(x = observations, y = preds, color = 'preds')) +
  ggtitle("Melbourne housing test data prediction") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Price')

By plotting the feature importance chart of the model, we can see that Distance has the biggest effect on predicting Price. Other important features in the top 5 are Regionname, Rooms, Landsize and Bathroom. It seems like price is affected the most by the location of the house (closeness to the central business district) and the size of the house (rooms, landsize, bathroom).

tree_imp <- lgb.importance(lgb_fin, percentage = TRUE)
lgb.plot.importance(tree_imp, top_n = 10L, measure = "Gain")

Conclusion

The kernel has shown steps of applying advanced machine learning techniques to a regression problem, particularly predicting Melbourne house price. After getting the dataset, exploration and feature engineering should be done thoroughly to help with model building. The Light Gradient Boosting model (Light GBM) performs the best at prediction and shows some notable factors that can affect pricing a house. This framework can be applied to other data problems, although more feature availability or broader hyperparameter tuning can help increase performance.

---
title: "Melbourne housing price prediction"
author: "Phong Viet Nguyen"
date: "2022-06-12"
output:
  html_notebook:
    toc: yes
---

```{r Load packages, echo=FALSE}
pacman::p_load(
  "tidyverse"      # data manipulation
  , "caret"        # model training
  , "recipes"      # pre-processing
  , "rsample"      # data splitting
  , "DataExplorer" # EDA
  , "corrplot"     # correlation plot
  , "glmnet"       # elastic net modelling
  , "MLmetrics"    # model metrics
  , "vip"          # variable importance
  , "pdp"          # partial dependence plot
  , "forcats"
  , "lightgbm"
  , "lsr"         # for chi squared function
  , "ParBayesianOptimization"
)
```


# Introduction

This kernel's goal is to establish an advanced regression model to predict housing price in Melbourne, Australia. The data is a snapshot in September 2017 of [Melbourne housing clearance data]("https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market") created by Tony Pino. A well-tuned model can help discover a reasonably priced property.

# Data Exploration

## Data Structure

```{r Load data, warning=FALSE, message=FALSE}
melb_house <- read_csv("melb_data.csv", na = "")
melb_house$Date <- as.Date(melb_house$Date, format = "%d/%m/%Y") # format date column
glimpse(melb_house)
print(paste("The start date is", min(melb_house$Date)))
print(paste("The end date is", max(melb_house$Date)))

```
The dataset contains 13580 properties with 21 attributes. It covers the period from 28/01/2016 to 23/09/2017.

## Target Variable

As we want to predict the housing price, *Price* will be the target variable.

```{r Target variable summary}
summary(melb_house$Price)
```

```{r Target variable plot}
ggplot(melb_house, aes(Price)) +
  geom_histogram(fill = "deepskyblue3", binwidth = 100000)
```

The sale prices are right skewed, with some outliers in the highly priced range. 

# Feature Selection

## Handle Missing Data

First we plot all the variables that have missing values.

```{r Missing values}
plot_missing(melb_house, missing_only = TRUE)
```

There are 4 variables with missing values. The *BuildingArea* has almost half of the values are NAs, therefore it will be removed in the analysis. For the variable *YearBuilt*, there are 39.58% NAs. Because this variable may imply the house's condition, we will replace the NAs with *Not Available* and try binning the others into groups. Lets take a look at the quantile of those values:

```{r YearBuilt quantile}
quantile(melb_house$YearBuilt, na.rm = TRUE)
```
There seems to be really old properties dating back to the year 1196. Looking at the quantile we propose 4 approximately similar groups of house based on age:  *Older than 80 years*, *50-80 years*, *20-50 years*, *Less than 20 years*.

```{r Lumping data into Age}
melb_house_prep <- melb_house %>% 
  mutate(Age = if_else(is.na(YearBuilt), "Not Available"
           , if_else(YearBuilt < 1937, "Older than 80 years" 
                       , if_else(YearBuilt < 1967, "50-80 years"
                               , if_else(YearBuilt < 1997, "20-50 years"
                                       , "Less than 20 years"))))
         )
ggplot(melb_house_prep, aes(Age)) +
  geom_bar(fill = "deepskyblue3")
```
For the *Car* variable, there are only 0.46% NAs where the number of car slots are not informed. We will consider them as no parking slots.

```{r Car}
melb_house_prep$Car[is.na(melb_house_prep$Car)] <- 0
ggplot(melb_house_prep, aes(Car)) +
  geom_histogram(fill = "deepskyblue3", binwidth = 0.5) +
  scale_x_continuous(breaks = seq(0, 10, 1))
```
For *CouncilArea*, there are other variables that refer to location, such as *Suburb* and *Postcode*. Let us consider their correlation using chi-squared test and Cramer's V value:

```{r Correlation test Suburb, CouncilArea, Postcode, warning=FALSE}
melb_house$Suburb <- factor(melb_house$Suburb)
melb_house$Postcode <- factor(melb_house$Postcode)
df <- melb_house %>% select(Suburb, Postcode, CouncilArea)
df$CouncilArea <- df$CouncilArea %>% replace_na("Not Available")
df$CouncilArea <- factor(df$CouncilArea)

# function to get chi square p value and Cramers V
f <- function(x,y) {
  tbl <- df %>% select(x,y) %>% table()
  chisq_pval <- round(chisq.test(tbl)$p.value, 4)
  cramV <- round(cramersV(tbl), 4) 
  data.frame(x, y, chisq_pval, cramV) }

# create unique combinations of column names
# sorting will help getting a better plot (upper triangular)
df_comb <- data.frame(t(combn(sort(names(df)), 2)), stringsAsFactors = F)

# apply function to each variable combination
df_res <- map2_df(df_comb$X1, df_comb$X2, f)
```

```{r Plot correlation results}
df_res %>%
  ggplot(aes(x, y, fill = chisq_pval))+
  geom_tile()+
  geom_text(aes(x,y,label=cramV))+
  scale_fill_gradient(low="red", high="yellow")+
  theme_classic()
```


There are strong correlations between each of the 3 categorical variables based on the high Cramer's V and the p-value of chi-squared test close to 0. Therefore, we will only use *Suburb* in the prediction model and drop the other 2 variables as there will be little information loss. For the same reason, *Address*, *Lattitude* and *Longtitude* can also be dropped as they only contain unique information about location.

## Transformation of Date variable

The date that the house is sold could be of some values, even though it is not really informative with just the exact date. We will separate the variable *Date* into *MonthYear* indicating the month and the year it is in.

```{r Change date to month and year}
melb_house_prep <- melb_house_prep %>% 
  mutate(MonthYear = format(as.Date(Date), "%Y-%b"))
```

```{r Convert all categorical variables into factor}
melb_house_prep$Suburb <- factor(melb_house_prep$Suburb)
melb_house_prep$Type <- factor(melb_house_prep$Type)
melb_house_prep$Method <- factor(melb_house_prep$Method)
melb_house_prep$SellerG <- factor(melb_house_prep$SellerG)
melb_house_prep$Regionname <- factor(melb_house_prep$Regionname)
melb_house_prep$Age <- factor(melb_house_prep$Age)
# Order month name
monthlevel <- paste0(rep(2016:2017, each = 12), "-", month.abb)
melb_house_prep$MonthYear <- factor(melb_house_prep$MonthYear
                                    , levels = monthlevel)
```

## Correlation of continuous variables

Next we will check the correlation plot. There are 7 continuous predictor variables remained. Let us check if they are correlated with *Price* sold.

```{r Correlation of numeric variables}
# Reduce the variables in the dataset
melb_house_prep <- melb_house_prep %>% 
  select(Price, Suburb, Rooms, Type, Method, SellerG, Distance, Bedroom2
         , Bathroom, Car, Landsize, Regionname, Propertycount, Age, MonthYear)
# Index vector numeric variables
numericVars <- which(sapply(melb_house_prep, is.numeric))
numericVarNames <- names(numericVars) #saving names vector for use later on
# Correlations of all numeric variables
all_numVar <- melb_house_prep[, numericVars]
cor_numVar <- cor(all_numVar, use = "pairwise.complete.obs")
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")
```
The number of rooms, bedrooms and bathrooms seems to be correlated with the price of the house. Rooms and Bedrooms are also highly correlated, so we can drop Bedrooms in the modelling process. We will also omit *PropertyCount* because it has an one on one representation of each *Suburb*.

```{r Drop bedroom}
melb_house_prep <- melb_house_prep %>%
  select(-Bedroom2, -Propertycount)
```


# Data Pre-processing

After transforming and reducing the number of predictors:

```{r Number of predictors}
newnumericVars <- melb_house_prep %>% 
  select(where(is.numeric))
nonnumericVars <- melb_house_prep %>% 
  select(!where(is.numeric))
cat('There are', length(newnumericVars) - 1, 'continuous predictor variables'
    , 'and', length(nonnumericVars), 'categorical predictor variables.')
```
## Continuous variables

```{r Histogram of numeric variables}
plot_histogram(newnumericVars, ncol = 3L)
```
The continous variables need to be normalized to alleviate skewness. The histogram shows some abnormal pattern in *Landsize*. A better look at its boxplot shows an extreme outlier. We shall remove that instance for better model prediction.

```{r Landsize}
ggplot(melb_house_prep, aes(Landsize)) +
  geom_boxplot()+
  coord_flip()
```
```{r Remove outlier}
melb_house_prep <- melb_house_prep[-which(melb_house_prep$Landsize == max(melb_house_prep$Landsize)), ]
```

## Categorical variables

Now looking at the categorical variables, *Suburb* has 314 categories and *SellerG* has 268 categories, which will need binning.

```{r Bar plot of categorical variables}
# Plot the categorical variables
plot_bar(nonnumericVars, ncol = 3L)
```
The contingency table for *MonthYear* shows that there are very few observations in 2016-Jan and 2016-Feb, and there were even no house sale in 2016-Mar and 2017-Jan (there were no problem in the contingency table for other categorical variables).

```{r MonthYear table}
table(melb_house_prep$MonthYear)
```
To prevent overfitting, we will bin 2016-Jan and 2016-Feb together.

```{r Bin MonthYear}
melb_house_prep <- melb_house_prep %>% 
  mutate(MonthYearNew = ifelse(MonthYear %in% c("2016-Jan", "2016-Feb"),
                               "2016-JanFeb", as.character(MonthYear))) %>% 
  select(-MonthYear)
```

## Sampling and data partitioning

We split the data to begin the modelling process by performing stratified sampling to keep the similar target distribution. The training set will keep 70% of the observations and the test set keeps 30%.

```{r Split the data}
# Stratified sampling with the rsample package
set.seed(1235)
split_strat  <- initial_split(melb_house_prep, prop = 0.7
                              , strata = "Price")
melb_train  <- training(split_strat)
melb_test   <- testing(split_strat)
```

## Feature engineering

Examining the categorical variables, there are no near-zero variance variables.
```{r Nearzero variance check}
# Check for near-zero variances
nearZeroVar(melb_train, saveMetrics = TRUE) %>% 
  tibble::rownames_to_column() %>% 
  filter(nzv)
```

In the next step we execute normalization to minimize the skewness of numerical variables. We use log transform for the target variable *Price* and Yeo-Johnson transformation for the predictors since the values are not strictly positive. We also standardize the numerical predictors so that they have the same magnitude in the models. Then, as said above, we will reduce the number of categories in *Suburb* and *SellerG*. All levels that are observed in less than 0.5% of the training sample will be collapsed into an *Other* category. We have to carefully choose a small percentage because of possible information loss. Lastly, we use dummy encoding to transform the categorical variables into numeric representations. The feature engineering steps in summary are:
1. Normalize numeric features.
2. Standardize numeric predictor features.
3. Binning *Suburb* and *SellerG*.
4. Dummy encode categorical features.

```{r Preprocessing recipe}
# Create recipe for pre-processing
blueprint_all <- recipe(Price ~ ., data = melb_train) %>%
  step_log(all_outcomes()) %>% 
  step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
  step_center(all_numeric(), -all_outcomes()) %>% 
  step_scale(all_numeric(), -all_outcomes()) %>%
  step_other(Suburb, threshold = 0.005, other = "Other") %>% 
  step_other(SellerG, threshold = 0.005, other = "Other") %>% 
  step_dummy(all_nominal())
prepare <- prep(blueprint_all, training = melb_train)
baked_train <- bake(prepare, new_data = melb_train) # df after preprocess
#plot_bar(baked_train) # check the bar plot
```

# Modelling

We are going to test 3 types of regression models on the data: a simple linear regression model, a regularized model and a light gradient boosting model. 10-fold cross-validation sampling is used for evaluation.

```{r Cross validation}
# Define a cross validation method
tr_control <- trainControl(method = "cv"
                           , number = 10
                           , savePredictions = TRUE)
```

## Linear model and Regularized model

Using the *caret* package, the linear model and the regularized model can be pre-processed and trained in the same fashion.

```{r Linear regression}
set.seed(1235)  # for reproducibility
cv_lm <- train(
  blueprint_all
  , data = melb_train
  , method = "lm"
  , trControl = tr_control
)
```

The regularized model, or elastic net, has 2 hyperparameters alpha and lambda. We set *tunelength = 10* to do a grid search of the 10 default values of each parameter.

```{r Elastic net}
set.seed(1235)
cv_en <- train(
  blueprint_all
  , data = melb_train
  , method = "glmnet"
  , trControl = tr_control
  , tuneLength = 10
)
```

Based on the plot of the performance measures, there are no significant differences between the linear model and the regularized model. In this case, the linear model might be a better choice because of its interpretability.

```{r Performance measures}
resamps <- resamples(list(
  LM = cv_lm
  , GLMnet = cv_en
))
summary(resamps)

trellis.par.set(caretTheme())
bwplot(resamps)
```
The RMSE is after log transformation of the target variable. Reversing it to normal scale can help with practical reasoning.

```{r}
rmset <- function(model){
  y = log(melb_train$Price)
  y.pred = predict(model, melb_train)
  y <- exp(y)
  y.pred <- exp(y.pred)
  return(sqrt(mean((y - y.pred)^2)))
}

paste("The RMSE of the linear model is:", rmset(cv_lm))
```

## Light gradient boosting machine

Gradient Boosting Machine is a modern machine learning technique that makes use of an ensemble of weak learners, typically decision trees. Light GBM (by Microsoft) is a high-performance framework that offers significantly fast training speed and accuracy thanks to leaf-wise growth and other features. To fit the model, first we need to convert the training data into a matrix that is supported by *lightgbm* package.

```{r LightGBM data}

# Convert into lgb dataset
new_melb <- lgb.convert_with_rules(melb_train %>% select(-Price))
train_label <- melb_train[, "Price"]
train_matrix <- lgb.Dataset(data = as.matrix(new_melb$data)
                            , label = as.matrix(train_label))

```

There are a variety of hyperparameters in lightgbm. We will perform tuning for 4 important hyperparameters:
1. num_leaves: the maximum number of leaves that each learners has.
2. max_depth: the maximum depth of each trained tree.
3. bagging_fraction: the percentage of rows used per iteration.
4. feature_fraction: the percentage of random subset of features per iteration.

Hyperparameter tuning will be carried out by Bayesian Optimization, which uses the information from prior model evaluations to guide future parameter searches. In this method, the prior response distribution is iteratively updated based on our best guess of where the best parameters are. The package used is *ParBayesianOptimization*.

```{r Lgb tuning definition}
# Define scoring function for BayesOpt
scoringFunction <- function(num_leaves, max_depth
                            , bagging_fraction, feature_fraction) {
  
  Pars <- list(
    objective = "regression"
    , metric = "rmse"
    , num_iterations = 1000
    , early_stopping_rounds = 50
    , learning_rate = 0.01
    , num_leaves = num_leaves
    , max_depth = max_depth
    , bagging_fraction = bagging_fraction
    , bagging_freq = 1
    , feature_fraction = feature_fraction
  )
  
  lgbcv <- lgb.cv(
    params = Pars
    , data = train_matrix
    , nfold = 5L
    , stratified = FALSE
  )
  
  return(list(Score = -1*lgbcv$best_score
              , nrounds = lgbcv$best_iter
  )
  )
}

# Define the search boundaries
bounds <- list(
  num_leaves = c(5L, 30L)
  , max_depth = c(2L, 5L)
  , bagging_fraction = c(0.4, 1)
  , feature_fraction = c(0.4, 1)
)

```

We define a scoring function to optimize. Then, we set the search boundaries for each hyperparameter. BayesOpt then try find parameters that maximize the scoring function based on some initial scoring.

```{r BayesOpt tuning, results='hide'}
start.time <- Sys.time()
set.seed(1235)

optObj <- bayesOpt(
  FUN = scoringFunction
  , bounds = bounds
  , initPoints = 5
  , iters.n = 20
  , iters.k = 1
)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
```

The results of 20 iterations are as below.
```{r BayesOpt summary}
optObj$scoreSummary
```

The best tuned hyperparameters are num_leaves = 21, max_depth = 5, bagging_fraction = 0.741 and feature_fraction = 0.658.

```{r BayesOpt get best score}
getBestPars(optObj)
```
The RMSE of the Light GBM model is better than the Linear model. For prediction purpose we proceed with the Light GBM model with the best hyperparameters, although we can choose the linear model if the goal is interpretability.

```{r Lgb rmse}
-1*max(optObj$scoreSummary$Score)
```

## Final model performance

We use the chosen hyperparameters to train a Light GBM model with higher iterations.

```{r Lgb chosen model, results=FALSE, warning=FALSE, message=FALSE}
# Hyperparameter
params <- list(
  objective = "regression"
  , metric = "RMSE"
  , num_iterations = 5000
  , num_threads = 2
  , learning_rate = 0.01
  , num_leaves = 21
  , max_depth = 5
  , seed = 1235
  , bagging_fraction = 0.741
  , bagging_freq = 1
  , feature_fraction = 0.658
)

# Train lgb model
lgb_fin <- lgb.train(params = params, data = train_matrix)
```
Applying the final model to the unseen test set yields the following results:

```{r Prediction}
# Convert test set into lgb dataset
new_melb_test <- lgb.convert_with_rules(melb_test %>% select(-Price))
test_label <- melb_test$Price
# Predict test data
preds <- predict(lgb_fin, data = as.matrix(new_melb_test$data))

# Accuracy check
mse <- mean((test_label - preds)^2)
mae <- caret::MAE(test_label, preds)
rmse <- caret::RMSE(test_label, preds)
rsq <- function (x, y) cor(x, y) ^ 2
r_squared <- rsq(test_label, preds)

cat("MSE: ", mse, "\nMAE: ", mae, "\nRMSE: ", rmse, "\nR-squared: ", r_squared)
```
The plot between actual and predicted values is as follow:

```{r Plot prediction vs test data}
df_visual <- data.frame(test_label, preds)
df_visual$observations <- 1:nrow(df_visual)
ggplot(data = df_visual) + 
  geom_line(aes(x = observations, y = test_label, color = 'test')) + 
  geom_line(aes(x = observations, y = preds, color = 'preds')) +
  ggtitle("Melbourne housing test data prediction") +
  theme(plot.title = element_text(hjust = 0.5)) +
  ylab('Price')
```

By plotting the feature importance chart of the model, we can see that *Distance* has the biggest effect on predicting *Price*. Other important features in the top 5 are *Regionname*, *Rooms*, *Landsize* and *Bathroom*. It seems like price is affected the most by the location of the house (closeness to the central business district) and the size of the house (rooms, landsize, bathroom).

```{r Feature importance}
tree_imp <- lgb.importance(lgb_fin, percentage = TRUE)
lgb.plot.importance(tree_imp, top_n = 10L, measure = "Gain")
```
# Conclusion

The kernel has shown steps of applying advanced machine learning techniques to a regression problem, particularly predicting Melbourne house price. After getting the dataset, exploration and feature engineering should be done thoroughly to help with model building. The Light Gradient Boosting model (Light GBM) performs the best at prediction and shows some notable factors that can affect pricing a house. This framework can be applied to other data problems, although more feature availability or broader hyperparameter tuning can help increase performance.
