| Name | Matric Number |
|---|---|
| LOW SEH HONG | S2178662 |
| NICHOLAS TAN YU ZHE | S2180436 |
| YAP JIA XIAN | S2150857 |
| HARITHARAN | S2157199 |
| JINYIN JIA | S2176584 |
| MINGZHUO LI | S2150565 |
In this project, we aim to develop a machine-learning model that can accurately predict the sales prices of flats in Singapore. By using historical data on flat sales prices and relevant information such as location, square footage, and the number of bedrooms, we aim to establish a relationship between these factors and sales prices. This model can be used by real estate agents, developers, and investors to make informed decisions about the buying and selling of properties in Singapore. By providing accurate predictions, this model can help to reduce the uncertainty and risk associated with the real estate market.
Collect and pre-process historical data on the sales price of condominiums in Singapore, and related information.
Develop a forecasting model, to create that can establish a relationship between the independent and dependent variables.
Evaluate the performance of the model using appropriate metrics and compare the performance of different models.
This dataset is the total resale transacted prices for flat in Singapore and is obtained from Singapore’s Public Data. The data is based on the date of registration for the resale transactions. It is managed by Housing and Development Board (HDB) and is regularly updated monthly.
The period of the chosen dataset is from January 2017 to January 2023, however we will select up to December 2022.
cat(paste(head(df$month, 1), "to", tail(df$month, 1)))
## 2017-01 to 2023-01
The dimension of the dataset is as follows:
dim(df)
## [1] 144500 11
The content and the structure of the dataset can be found under the table below:
| Column | Type | Description |
|---|---|---|
| month | character | The month of transaction |
| town | character | The town of flat |
| flat_type | character | The flat type |
| block | character | The flat block number |
| street_name | character | The street name of the block |
| storey_range | character | The floor number of flat |
| floor_area_sqm | numeric | The floor area in sqm |
| flat_model | character | The flat model |
| lease_commence_date | numeric | The lease commence date in year |
| remaining_lease | character | The remaining lease of flat |
| resale_price | numeric | The resale price of flat |
The summary of the dataset can be found below:
summary(df)
## month town flat_type block
## Length:144500 Length:144500 Length:144500 Length:144500
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## street_name storey_range floor_area_sqm flat_model
## Length:144500 Length:144500 Min. : 31.00 Length:144500
## Class :character Class :character 1st Qu.: 82.00 Class :character
## Mode :character Mode :character Median : 94.00 Mode :character
## Mean : 97.64
## 3rd Qu.:113.00
## Max. :249.00
## lease_commence_date remaining_lease resale_price
## Min. :1966 Length:144500 Min. : 140000
## 1st Qu.:1985 Class :character 1st Qu.: 355000
## Median :1996 Mode :character Median : 445000
## Mean :1995 Mean : 477049
## 3rd Qu.:2006 3rd Qu.: 565000
## Max. :2019 Max. :1418000
library(lubridate)
library(dplyr)
library(tidyr)
any(is.na(df))
## [1] FALSE
df$date <- ym(df$month)
df$year <- year(df$date)
df$month <- month(df$date)
df$town <- as.factor(df$town)
df$flat_type <- as.factor(df$flat_type)
df$storey_range <- as.factor(df$storey_range)
df$flat_model <- as.factor(df$flat_model)
df <- df[,!names(df) %in% c("street_name", "date","block")]
df <- df[ which(df$year<=2022), ]
remaining_lease_func <- function(df) {
temp <- df %>% separate(remaining_lease, c('col1', 'col2', 'col3', 'col4'), fill = "right")
temp[is.na(temp)] <- 0
years_remaining <- as.numeric(temp$col1)
months_remaining <- as.numeric(temp$col3)
converted_months <- years_remaining*12
total_years <- (converted_months + months_remaining) / 12
return(total_years)
}
df$remaining_lease <- remaining_lease_func(df)
df <- df %>% mutate(price_category = cut(df$resale_price, breaks=3,labels = c("low","medium","high")))
Before, we proceed to EDA, the purpose of the EDA is to answer the following questions:
library(ggcorrplot)
library(patchwork)
lease_commence_date and remaining_lease is
highly correlated, so we will drop lease_commemce_datesubset_df <- df %>% select_if(is.numeric) %>% select(-c("resale_price"))
corr <- cor(subset_df)
ggcorrplot(corr, type="lower", lab=TRUE, ggtheme = ggplot2::theme_gray, colors = c("#636EFA", "white", "#EF553B"))
df <- df %>% select(-c("lease_commence_date"))
df %>% count(price_category)
## price_category n
## 1 low 108214
## 2 medium 34269
## 3 high 938
ggplot(df, aes(x=resale_price)) +
geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) +
geom_density(alpha=.2, fill="#636EFA")
After removing outlier of resale_price, the dataset observations from 143421 to 139795. The distribution of resale_price is shown below:
quantile_list <- quantile(df$resale_price)
q1 <- quantile_list[[2]]
q3 <- quantile_list[[4]]
iqr <- q3 - q1
lower_bound <- q1 - (1.5 * iqr)
upper_bound <- q3 + (1.5 * iqr)
df <- df %>% filter(resale_price >= lower_bound & resale_price <= upper_bound)
ggplot(df, aes(x=resale_price)) +
geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) +
geom_density(alpha=.2, fill="#636EFA")
Now need to redefine the resale_price category.
df <- df %>% mutate(price_category = cut(df$resale_price, breaks=3, labels = c("low","medium","high")))
df %>% count(price_category)
## price_category n
## 1 low 47548
## 2 medium 72186
## 3 high 20061
In summary, the resale price of a property may be positively affected by larger floor area and negatively affected by shorter remaining lease.
# See the relationship of floor area vs price
p1 <- df %>% ggplot(aes(y=floor_area_sqm, x=resale_price, group=1)) + geom_boxplot(fill="#636EFA") + coord_flip()
p2 <- df %>% ggplot(aes(x=floor_area_sqm)) + geom_histogram(aes(y=after_stat(density)), fill="#636EFA", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p3 <- df %>% ggplot(aes(y=floor_area_sqm, x=resale_price)) + geom_violin(fill="#636EFA") + coord_flip()
# See the relationship of remaining lease vs price
p4 <- df %>% ggplot(aes(y=remaining_lease, x=resale_price, group=1)) + geom_boxplot(fill="#EF553B") + coord_flip()
p5 <- df %>% ggplot(aes(x=remaining_lease)) + geom_histogram(aes(y=after_stat(density)), fill="#EF553B", bins=30) + geom_density(alpha=.2, fill="#636EFA")
p6 <- df %>% ggplot(aes(y=remaining_lease, x=resale_price)) + geom_violin(fill="#EF553B") + coord_flip()
(p1 + p2 + p3)/(p4 + p5 + p6)
# See the relationship of town vs price
p1 <- df %>% ggplot(aes(x=reorder(town, resale_price), y=resale_price)) + geom_boxplot(fill="#636EFA") +
labs(y="resale_price", x="town", title="Relationship between town and resale price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p1
By comparing the box plots for different towns, if the box plots for towns in highly desirable locations have higher medians and larger interquartile ranges than the box plots for towns in less desirable locations, this suggests that properties in more desirable towns tend to have higher resale prices.
# See the relationship of flat type vs price
p2 <- df %>% ggplot(aes(x=reorder(flat_type, resale_price), y=resale_price)) + geom_boxplot(fill="#EF553B") +
labs(y="resale_price", x="flat_type", title="Relationship between flat type and resale price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p2
By comparing the box plots for different flat types, if the box plots for flats with more bedrooms, bigger area or multi generation apartment have higher medians and larger interquartile ranges than the box plots for flats with less bedrooms, smaller area or no balcony, this suggests that those types of flats tend to have higher resale prices.
# See the relationship of storey range vs price
p3 <- df %>% ggplot(aes(x=reorder(storey_range, resale_price), y=resale_price)) + geom_boxplot(fill="#00CC96") +
labs(y="resale_price", x="storey_range", title="Relationship between storey range and resale price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p3
By evaluating the field plots for extraordinary storey levels, if the field plots for residences on better storeys have better medians and large interquartile tiers than the field plots for residences on decrease storeys, this shows that residences on better storeys generally tend to have better resale costs.
# See the relationship of flat model vs price
p4 <- df %>% ggplot(aes(x=reorder(flat_model, resale_price), y=resale_price)) + geom_boxplot(fill="#AB63FA") +
labs(y="resale_price", x="flat_model", title="Relationship between flat model and resale price") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
p4
Flat models that are newer, more modern and energy efficient tend to have higher resale prices than older or less modern flat models.
library(caret)
library(randomForest)
library(nnet)
library(e1071)
library(gbm)
library(yardstick)
set.seed(1022)
regression_dataset <- df[,!names(df) %in% c("price_category")]
rs_price <- createDataPartition(regression_dataset$resale_price, p = .80, list = FALSE)
training <- regression_dataset[ rs_price,]
testing <- regression_dataset[-rs_price,]
rf <- randomForest(resale_price ~ ., data = training)
pred <- predict(rf, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 25101
svm <- svm(resale_price~., data=training)
pred <- predict(svm, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 30781
gbm <- gbm(resale_price ~ ., data = training)
pred <- predict(gbm, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 53404
lr <- lm(resale_price ~ ., data = training)
pred <- predict(lr, testing)
mae <- mean(abs(testing$resale_price-pred))
cat(paste0("Mean Absolute Error: ", round(mae)))
## Mean Absolute Error: 46113
classification_dataset <- df[,!names(df) %in% c("resale_price")]
rs_cat <- createDataPartition(classification_dataset$price_category, p = .80, list = FALSE)
training <- classification_dataset[ rs_cat,]
testing <- classification_dataset[-rs_cat,]
rf <- randomForest(price_category ~ ., data = training)
pred <- predict(rf, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 95.22%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")
svm <- svm(price_category ~ ., data = training)
pred <- predict(svm, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 87.93%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")
nb <- naiveBayes(price_category ~ ., data = training,k = 1)
pred <- predict(nb, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 72.35%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")
lr <- multinom(price_category ~ ., data = training)
pred <- predict(lr, testing)
xtab <- table(pred, testing$price_category)
cm <- confusionMatrix(xtab)
cm2 <- conf_mat(cm$table, y_test, pred)
cat(paste0("Accuracy: ", round(cm$overall[[1]], 4)*100, "%"))
## Accuracy: 80.78%
autoplot(cm2, type="heatmap") + scale_fill_gradient(low="#D6EAF8", high="#2E86C1")
In this project, we try to use data science tools to process, analyze and predict flat sale price data in Singapore. First, we introduced the dataset as a whole, including the size of the dataset, the type of data in each column, etc. Then we performed pre-processing operations such as dataset cleaning and segmentation to obtain a dataset suitable for the model. To better understand the characteristics of the dataset itself, we then performed exploratory data analysis including correlation analysis, etc. Finally, we modeled the problem as regression and classification, respectively.
The reason why we modeled the problem as regression and classification is that our targeted users might be from different countries, predicted result in numeric value is hard to understand due to different currencies. Based on the statement above, the sales category will be more user-friendly for them. Besides, being able to predict the sales category of flats can help real estate companies and individuals to make more informed decisions, manage their inventory and risks, and offer personalized customer service.
For regression, the random forest achieved the best regression results. As for the classification problem, Random Forest also achieved the best classification accuracy. In summary, this project has modeled the sales price of Singapore flat with the help of data science, and the methods and models used can compete with the data characteristics to predict the sales price effectively.