Prepared by:

Name Matric Number
LOW SEH HONG S2178662
NICHOLAS TAN YU ZHE S2180436
YAP JIA XIAN S2150857
HARITHARAN S2157199
JINYIN JIA S2176584
MINGZHUO LI S2150565


Flat Sales Price Prediction in Singapore

Project Backgroud

In this project, we aim to develop a machine-learning model that can accurately predict the sales prices of flats in Singapore. By using historical data on flat sales prices and relevant information such as location, square footage, and the number of bedrooms, we aim to establish a relationship between these factors and sales prices. This model can be used by real estate agents, developers, and investors to make informed decisions about the buying and selling of properties in Singapore. By providing accurate predictions, this model can help to reduce the uncertainty and risk associated with the real estate market.

Project Objective

  1. Collect and pre-process historical data on the sales price of condominiums in Singapore, and related information.

  2. Develop a forecasting model, to create that can establish a relationship between the independent and dependent variables.

  3. Evaluate the performance of the model using appropriate metrics and compare the performance of different models.

Dataset

This dataset is the total resale transacted prices for flat in Singapore and is obtained from Singapore’s Public Data. The data is based on the date of registration for the resale transactions. It is managed by Housing and Development Board (HDB) and is regularly updated monthly.

The period of the chosen dataset is from January 2017 to January 2023, however we will select up to December 2022.

cat(paste(head(df$month, 1), "to", tail(df$month, 1)))
## 2017-01 to 2023-01

The dimension of the dataset is as follows:

dim(df)
## [1] 144500     11

The content and the structure of the dataset can be found under the table below:

Column Type Description
month character The month of transaction
town character The town of flat
flat_type character The flat type
block character The flat block number
street_name character The street name of the block
storey_range character The floor number of flat
floor_area_sqm numeric The floor area in sqm
flat_model character The flat model
lease_commence_date numeric The lease commence date in year
remaining_lease character The remaining lease of flat
resale_price numeric The resale price of flat

The summary of the dataset can be found below:

summary(df)
##     month               town            flat_type            block          
##  Length:144500      Length:144500      Length:144500      Length:144500     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  street_name        storey_range       floor_area_sqm    flat_model       
##  Length:144500      Length:144500      Min.   : 31.00   Length:144500     
##  Class :character   Class :character   1st Qu.: 82.00   Class :character  
##  Mode  :character   Mode  :character   Median : 94.00   Mode  :character  
##                                        Mean   : 97.64                     
##                                        3rd Qu.:113.00                     
##                                        Max.   :249.00                     
##  lease_commence_date remaining_lease     resale_price    
##  Min.   :1966        Length:144500      Min.   : 140000  
##  1st Qu.:1985        Class :character   1st Qu.: 355000  
##  Median :1996        Mode  :character   Median : 445000  
##  Mean   :1995                           Mean   : 477049  
##  3rd Qu.:2006                           3rd Qu.: 565000  
##  Max.   :2019                           Max.   :1418000

Data Cleaning

library(lubridate)
library(dplyr)
library(tidyr)
  1. Check Null value
    Need to solve the problem of missing data if there is any Null value. After the observation, no Null value in the dataset.
any(is.na(df))
## [1] FALSE
  1. Change month column to date format and split to month and year column
df$date <- ym(df$month)
df$year <- year(df$date)
df$month <- month(df$date)
  1. Change the town, flat_type, storey_range, flat_model as factor
df$town <- as.factor(df$town)
df$flat_type <- as.factor(df$flat_type)
df$storey_range <- as.factor(df$storey_range)
df$flat_model <- as.factor(df$flat_model)
  1. Drop columns, street_name, block and date
df <- df[,!names(df) %in% c("street_name", "date","block")]
  1. Subset dataframe We only need the year from 2017 - 2022 Dataset row from 144500 to 143421
df <- df[ which(df$year<=2022), ]
  1. Feature engineer the remaining lease into year format
remaining_lease_func <- function(df) {
  temp <- df %>% separate(remaining_lease, c('col1', 'col2', 'col3', 'col4'), fill = "right")
  temp[is.na(temp)] <- 0
  years_remaining <- as.numeric(temp$col1)
  months_remaining <- as.numeric(temp$col3)
  converted_months <- years_remaining*12
  total_years <- (converted_months + months_remaining) / 12
  return(total_years)
}
df$remaining_lease <- remaining_lease_func(df)
  1. Split the price into category
df <- df %>% mutate(price_category = cut(df$resale_price, breaks=3,labels = c("low","medium","high")))

Exploratory Data Analysis (EDA)

Before, we proceed to EDA, the purpose of the EDA is to answer the following questions:

  1. How correlates between the numerical variables?
  2. What is the relationship between the independent variables and dependent variable?
  3. Which model is the best model to predict the sales prices of flats?
  4. Which model is the best model to predict the sales category of flats?
library(ggcorrplot)
library(patchwork)
  1. Correlation between numeric features. From this, we can see that lease_commence_date and remaining_lease is highly correlated, so we will drop lease_commemce_date
subset_df <- df %>% select_if(is.numeric) %>% select(-c("resale_price"))
corr <- cor(subset_df)
ggcorrplot(corr, type="lower", lab=TRUE, ggtheme = ggplot2::theme_gray, colors = c("#636EFA", "white", "#EF553B"))

df <- df %>% select(-c("lease_commence_date"))
  1. Distribution of price_category & resale_price From the result, we can see that it is very imbalanced, and it is skew to the right. We will proceed to drop the outlier using IQR method.
df %>% count(price_category)
##   price_category      n
## 1            low 108214
## 2         medium  34269
## 3           high    938
ggplot(df, aes(x=resale_price)) + 
        geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + 
        geom_density(alpha=.2, fill="#636EFA")

After removing outlier of resale_price, the dataset observations from 143421 to 139795. The distribution of resale_price is shown below:

quantile_list <- quantile(df$resale_price)
q1 <- quantile_list[[2]]
q3 <- quantile_list[[4]]
iqr <- q3 - q1
lower_bound <- q1 - (1.5 * iqr)
upper_bound <- q3 + (1.5 * iqr)
df <- df %>% filter(resale_price >= lower_bound & resale_price <= upper_bound)
ggplot(df, aes(x=resale_price)) + 
  geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + 
  geom_density(alpha=.2, fill="#636EFA")

Now need to redefine the resale_price category.

df <- df %>% mutate(price_category = cut(df$resale_price, breaks=3, labels = c("low","medium","high")))
df %>% count(price_category)
##   price_category     n
## 1            low 47548
## 2         medium 72186
## 3           high 20061
  1. Relationship of different numerical variables vs resale_price.

In summary, the resale price of a property may be positively affected by larger floor area and negatively affected by shorter remaining lease.

# See the relationship of floor area vs price
p1 <- df %>% ggplot(aes(y=floor_area_sqm, x=resale_price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p2 <- df %>% ggplot(aes(x=floor_area_sqm)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p3 <- df %>% ggplot(aes(y=floor_area_sqm, x=resale_price)) + geom_violin(fill="#636EFA") + coord_flip()

# See the relationship of remaining lease vs price
p4 <- df %>% ggplot(aes(y=remaining_lease, x=resale_price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p5 <- df %>% ggplot(aes(x=remaining_lease)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p6 <- df %>% ggplot(aes(y=remaining_lease, x=resale_price)) + geom_violin(fill="#EF553B") + coord_flip()
(p1 + p2 + p3)/(p4 + p5 + p6)

  1. Relationship of different categorical variables vs resale_price
# See the relationship of town vs price
p1 <- df %>% ggplot(aes(x=reorder(town, resale_price), y=resale_price)) + geom_boxplot(fill="#636EFA") + 
          labs(y="resale_price", x="town", title="Relationship between town and resale price") + 
          theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p1

By comparing the box plots for different towns, if the box plots for towns in highly desirable locations have higher medians and larger interquartile ranges than the box plots for towns in less desirable locations, this suggests that properties in more desirable towns tend to have higher resale prices.

# See the relationship of flat type vs price
p2 <- df %>% ggplot(aes(x=reorder(flat_type, resale_price), y=resale_price)) + geom_boxplot(fill="#EF553B") + 
  labs(y="resale_price", x="flat_type", title="Relationship between flat type and resale price") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p2

By comparing the box plots for different flat types, if the box plots for flats with more bedrooms, bigger area or multi generation apartment have higher medians and larger interquartile ranges than the box plots for flats with less bedrooms, smaller area or no balcony, this suggests that those types of flats tend to have higher resale prices.

# See the relationship of storey range vs price
p3 <- df %>% ggplot(aes(x=reorder(storey_range, resale_price), y=resale_price)) + geom_boxplot(fill="#00CC96") + 
  labs(y="resale_price", x="storey_range", title="Relationship between storey range and resale price") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p3

By evaluating the field plots for extraordinary storey levels, if the field plots for residences on better storeys have better medians and large interquartile tiers than the field plots for residences on decrease storeys, this shows that residences on better storeys generally tend to have better resale costs.

# See the relationship of flat model vs price
p4 <- df %>% ggplot(aes(x=reorder(flat_model, resale_price), y=resale_price)) + geom_boxplot(fill="#AB63FA") + 
  labs(y="resale_price", x="flat_model", title="Relationship between flat model and resale price") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p4

Flat models that are newer, more modern and energy efficient tend to have higher resale prices than older or less modern flat models.

Modelling

library(caret)
library(randomForest)
library(nnet)
library(e1071)
library(gbm)
library(yardstick)
set.seed(1022)

Regression

regression_dataset <- df[,!names(df) %in% c("price_category")]
rs_price <- createDataPartition(regression_dataset$resale_price, p = .80, list = FALSE)
training <- regression_dataset[ rs_price,]
testing  <- regression_dataset[-rs_price,]

Random Forest

rf <- randomForest(resale_price ~ ., data = training)
pred <- predict(rf, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 25101

Support Vector Machine

svm <- svm(resale_price~., data=training)
pred <- predict(svm, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 30781

Gradient Boosting Machine

gbm <- gbm(resale_price ~ ., data = training)
pred <- predict(gbm, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 53404

Linear Regression

lr <- lm(resale_price ~ ., data = training)
pred <- predict(lr, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 46113

Classification

classification_dataset <- df[,!names(df) %in% c("resale_price")]
rs_cat <- createDataPartition(classification_dataset$price_category, p = .80, list = FALSE)
training <- classification_dataset[ rs_cat,]
testing  <- classification_dataset[-rs_cat,]

Random Forest

rf <- randomForest(price_category ~ ., data = training)
pred <- predict(rf, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 95.22%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")

Support Vector Machine

svm <- svm(price_category ~ ., data = training)
pred <- predict(svm, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 87.93%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")

Naive Bayes

nb <- naiveBayes(price_category ~ ., data = training,k = 1)
pred <- predict(nb, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 72.35%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")

Logistic Regression

lr <- multinom(price_category ~ ., data = training)
pred <- predict(lr, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 80.78%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")

Conclusion

In this project, we try to use data science tools to process, analyze and predict flat sale price data in Singapore. First, we introduced the dataset as a whole, including the size of the dataset, the type of data in each column, etc. Then we performed pre-processing operations such as dataset cleaning and segmentation to obtain a dataset suitable for the model. To better understand the characteristics of the dataset itself, we then performed exploratory data analysis including correlation analysis, etc. Finally, we modeled the problem as regression and classification, respectively.

The reason why we modeled the problem as regression and classification is that our targeted users might be from different countries, predicted result in numeric value is hard to understand due to different currencies. Based on the statement above, the sales category will be more user-friendly for them. Besides, being able to predict the sales category of flats can help real estate companies and individuals to make more informed decisions, manage their inventory and risks, and offer personalized customer service.

For regression, the random forest achieved the best regression results. As for the classification problem, Random Forest also achieved the best classification accuracy. In summary, this project has modeled the sales price of Singapore flat with the help of data science, and the methods and models used can compete with the data characteristics to predict the sales price effectively.