This R Markdown document is the complete report of the group project by Group 6 for Occurrence 1 of the WQD7004 course in Faculty of Computer Science and Information Technology, University of Malaya for Semester 2 of 2022/2023 Academic Session.

1 Introduction

Title:

This project aims to predict rental prices and classify the renovation levels of properties in Kuala Lumpur and Selangor, two of Malaysia’s most vibrant and populated regions. Malaysia’s rental market, particularly in these areas, has seen significant fluctuations due to various factors such as size, environment condition, and property characteristics. Accurate predictions of rental prices can provide critical insights to property investors, landlords, tenants, and policymakers, supporting informed decision-making and promoting a more balanced and fair rental market.

For the project, firstly, We will leverage advanced machine learning algorithms and data analytics techniques on the sample dataset collected from kaggle. Secondly, We will train three regression models to predict rental_price based on a comprehensive dataset that includes property characteristics, such as size, location, completion_year, property_type and so on. Next, as we want to explore classification problem as well, so based on the observation of Q1,Q3 on rental_price, we will generate a new classification target which is the level of renovation. Moreover is to build three classification models to predict renovation_level. Lastly, we will compare the performance of three regression models using r2 score and Mae score, and the performance of classification models using accuracy, precision, and f1-score.

Goal: Predict Rental Price and classify the renovation_level based on various variables

Objectives:

Research Questions

Dataset description: The data was scraped from mudah.my. We acquired the raw data from kaggle: https://www.kaggle.com/datasets/ariewijaya/rent-pricing-kuala-lumpur-malaysi

2. Main System

2.1 Load libraries and functions

source("function.R")
source("visualization.R")
library(tidyverse)
library(tidyr)
library(dplyr)
library(stringr)
library(lubridate)
library(readr)
library(magrittr)
library(ggplot2)
library(gridExtra)
library(caret)
library(missForest)
library(mice)
library(gbm)
library(randomForest)
library(reshape2)
library(rpart)
library(rpart.plot)
library(knitr)
library(scales)
library(e1071)
library(patchwork)
library(GGally)

2.2 General overview of the raw dataset

2.2.1 Load raw dataset

df <- read_csv("raw_rental mala.csv")
glimpse(df)
## Rows: 19,991
## Columns: 14
## $ ads_id                <dbl> 100323185, 100203973, 100323128, 100191767, 9702~
## $ prop_name             <chr> "The Hipster @ Taman Desa", "Segar Courts", "Pan~
## $ completion_year       <dbl> 2022, NA, NA, 2020, NA, NA, NA, 2018, 2014, NA, ~
## $ monthly_rent          <chr> "RM 4 200 per month", "RM 2 300 per month", "RM ~
## $ location              <chr> "Kuala Lumpur - Taman Desa", "Kuala Lumpur - Che~
## $ property_type         <chr> "Condominium", "Condominium", "Apartment", "Apar~
## $ rooms                 <dbl> 5, 3, 3, 2, 1, 3, 3, 1, 2, 3, 2, 4, 1, 3, 5, 3, ~
## $ parking               <dbl> 2, 1, NA, 1, 1, 1, 2, 1, 1, 1, NA, 2, 1, 1, NA, ~
## $ bathroom              <dbl> 6, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, ~
## $ size                  <chr> "1842 sq.ft.", "1170 sq.ft.", "650 sq.ft.", "743~
## $ furnished             <chr> "Fully Furnished", "Partially Furnished", "Fully~
## $ facilities            <chr> "Minimart, Gymnasium, Security, Playground, Swim~
## $ additional_facilities <chr> "Air-Cond, Cooking Allowed, Washing Machine", "A~
## $ region                <chr> "Kuala Lumpur", "Kuala Lumpur", "Kuala Lumpur", ~

It can be found that the raw dataset is very dirty. We first find the following issues:

  • ads_id and prop_name can be meanless variables.
  • completion_year, parking include large number of NA value.
  • monthly_rent wrongly assigned to chr type because of RM and per month description.
  • location is chr data, need further process.
  • property_type, is char data, need further transformation.
  • rooms, bathroom seems fine for first view.
  • size wrongly assigned to chr type because of sq.ft.
  • furnished is char data, need further transformation.
  • facilities and additional_facilities seems meanless, need further process.
  • region is char data, need further transformation.

2.3 Data Pre-processing

2.3.1 Remove duplicated data

# Remove duplicated values
dup1 <- sum(duplicated(df))
cat('The number of duplicated data from the dataset before processing is:',dup1)
## The number of duplicated data from the dataset before processing is: 8
df <- df[!duplicated(df), ]
dup2 <- sum(duplicated(df))
cat(' The number of duplicated data from the dataset after processing is:',dup2)
##  The number of duplicated data from the dataset after processing is: 0
  • Firstly, we check and delete the duplicated data from the raw dataset.

2.3.2 Removing ‘RM’ from monthly rent and rename the column

df <- df %>%
  mutate(monthly_rent = gsub("RM|per month|,", "", monthly_rent),
         monthly_rent = gsub("\\s+", "", monthly_rent),
         monthly_rent = as.numeric(monthly_rent))
colnames(df)[colnames(df) == "monthly_rent"] <- "monthly_rent_rm"
  • For monthly_rent, we remove the RM, per month, transform it as numeric and rename the column as monthly_rent_rm.

2.3.3 Dropping ‘sq.ft’ from size

df$size <- as.numeric(str_extract(df$size, "(.*?)(?= sq)"))
# rename column name as size_sqft
colnames(df)[colnames(df) == "size"] <- "size_sqft"
  • For size, we remove the sq.ft, transform it as numeric and and rename the column as size_sqft.

2.3.4 Extract near_by_ktm_lrt from location count with yes/no

df$location <- str_extract(df$location, "\\w+$")
  • For location, we remove the province name, such as KL and Selangor.
  • Further transformation will process after visualization.

2.3.5 Facilities and additional_facilities processing

# count the no of facilities 
df <- df %>%
  mutate(fac_count = str_count(facilities, ",") + 1,
         add_fac_count = str_count(additional_facilities, ",") + 1)
# generate a new column named nearby_ktm_lrt from aditional_facilities 
df$nearby_ktm_lrt <- sapply(df$additional_facilities, function(x) extract_near_ktm_lrt(x))
  • Facilities and additional_facilities, we count the number of facilities it included for each and create two new column as fac_count and add_fac_count.
  • Secondly, we think that if the house near LRT can be a important factor that affects monthly rent.
  • So, we create a new column named nearby_ktm_lrt and judgment the value yes/no based on LRT and KTM appeared in additional_facilities.

2.3.6 drop ads_id, prop_name, facilities, additional_facilities, location

df <- df[, !(names(df) %in% c('ads_id', 'prop_name', 'facilities', 'additional_facilities'))]

glimpse(df)
## Rows: 19,983
## Columns: 13
## $ completion_year <dbl> 2022, NA, NA, 2020, NA, NA, NA, 2018, 2014, NA, 2019, ~
## $ monthly_rent_rm <dbl> 4200, 2300, 1000, 1700, 1299, 1500, 2900, 1550, 1400, ~
## $ location        <chr> "Desa", "Cheras", "Desa", "Sentul", "Kiara", "Setapak"~
## $ property_type   <chr> "Condominium", "Condominium", "Apartment", "Apartment"~
## $ rooms           <dbl> 5, 3, 3, 2, 1, 3, 3, 1, 2, 3, 2, 4, 1, 3, 5, 3, 3, 2, ~
## $ parking         <dbl> 2, 1, NA, 1, 1, 1, 2, 1, 1, 1, NA, 2, 1, 1, NA, 1, 1, ~
## $ bathroom        <dbl> 6, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, ~
## $ size_sqft       <dbl> 1842, 1170, 650, 743, 494, 884, 982, 700, 750, 862, 86~
## $ furnished       <chr> "Fully Furnished", "Partially Furnished", "Fully Furni~
## $ region          <chr> "Kuala Lumpur", "Kuala Lumpur", "Kuala Lumpur", "Kuala~
## $ fac_count       <dbl> 10, 9, 4, 8, 11, 6, 10, 7, 5, 6, 7, 9, 13, 11, 3, 13, ~
## $ add_fac_count   <dbl> 3, 3, NA, 3, 1, 2, 4, 4, 4, 3, 2, 4, 4, 3, 3, 5, 5, 3,~
## $ nearby_ktm_lrt  <chr> "no", "yes", NA, "yes", "no", "yes", "yes", "yes", "ye~
  • Next, we drop the 4 useless columns ‘ads_id’, ‘prop_name’, ‘facilities’, ‘additional_facilities’.

2.4 Visualization

2.4.1 Bar Plot

ggplot(df) + geom_bar(mapping = aes(x = furnished, fill = "red")) + geom_text(mapping = aes(x = furnished, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")

ggplot(df) + geom_bar(mapping = aes(x = region, fill = "red")) + geom_text(mapping = aes(x = region, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")

ggplot(df) + geom_bar(mapping = aes(x = completion_year, fill = "red")) + geom_text(mapping = aes(x = completion_year, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")

ggplot(df) + geom_bar(mapping = aes(x = bathroom, fill = "red")) + geom_text(mapping = aes(x = bathroom, y = ..count.., label = ..count..), stat = "count", position = position_stack(vjust = 0.5)) + theme(legend.position = "none")

  • Fully furnished shows the most count and there are 5 NA data in it
  • Kuala Lumpur and Selangor have almost the same value count.
  • There are more rental shows in year 2021 compare to the rest.
  • Most of the rented house have 2 bathroom and 3 of the rented house have 8 bathroom.

2.4.2 Stack bar description

ggplot(df, aes(fill=bathroom, y=monthly_rent_rm, x=rooms)) +
  geom_bar(position="dodge", stat="identity")

  • From the above bar chart we can conclude that 2 room and below have 1 bathroom. 4 rooms have 4-6 bathrooms. There are more rented house with 4 rooms with 4-6 bathrooms.

2.4.3 Parallel Coordinate Plot

#Normalization done to shows a better Parallel Coordinate results
process <- preProcess(as.data.frame(df), method=c("range"))
norm_scale <- predict(process, as.data.frame(df))
ggparcoord(norm_scale,
           columns = 4:6,
           groupColumn = "monthly_rent_rm",
           showPoints = TRUE)

  • Above plot is to shows relationship between the 3 variables which is rooms, parking and bathroom.
  • The values in a parallel coordinate plot are always normalized.
  • Each of the line are collection of points of each of the variables and all are connected.
  • This graph below shows that all this 3 variables relationship with monthly rental is at the price below 0.25.
  • The cons of this graph is graph may become over-cluttered as data hasn’t been clean yet which may be one of the reason all are showing below 0.25.

2.4.4 Histogram for interval cropped

# histogram
grid.arrange(his_completion_year, his_monthly_rm, his_parking, his_size, 
             his_rooms,his_bathroom,his_fac_count,his_add_fac_count,nrow = 4,ncol = 2)

  • I first using histogram to overview all the interval varaibles.
  • After the parameters adjustment, the histogram is shown as the above diagram.
  • The best parameters setting wiil be shown in next data cleaning section.

2.4.5 Box plot for some cropped comparation

# boxplot
grid.arrange(box_uncropped_size,box_cropped_size,
             box_uncropped_bathrooms,box_cropped_bathrooms,
             box_uncropped_parking,box_cropped_parking,
             box_uncropped_rent,box_cropped_rent,
             box_uncropped_rooms,box_cropped_rooms,nrow = 5, ncol = 2)

  • Secondly, I using box-plot to show the data distribution for before and after cropped.

2.4.6 Pie Chart for nominal variables

pie_furnished <- create_pie_chart(df, "furnished")

pie_property_type <- create_pie_chart(df, "property_type")

pie_region <- create_pie_chart(df, "region")

pie_nearby_ktm_lrt <- create_pie_chart(df, "nearby_ktm_lrt")

  • Next I used pie chart to overview nominal variable distribution.
  • As you can view, all the nominal variable except property_type distribution seems good.
  • For property_type, more than 95% percent of data are in Apartment, Condominium, and Service Residence.

2.4.7 Line chart(interval) VS monthly_rent

grid.arrange(line_size_VS_rent, line_year_VS_rent, line_Parking_VS_rent,  
             line_add_fac_count_VS_rent,line_fac_count_VS_rent,line_rooms_VS_rent,line_bathroom_VS_rent,nrow = 4,ncol = 2)

  • In the end, I use plot chart to view the relation among interval variable to monthly rent.
  • And box plot to view the nominal variable compare to monthly rent.
  • It can be found that, for plot chart, Rental prices tend to rise as most of these interval variable increase. Completion Year appear to have the least variation.

2.4.7 Box plot(normal) VS monthly_rent

grid.arrange(box_lrt_VS_rent,
             box_furnished_VS_rent,
             box_property_VS_rent,
             box_region_VS_rent, nrow = 4, ncol = 1)

  • Through the box plot, it can be seen that most of the nominal variables have some interference data when the rent price is greater than 4000 and smaller than 500.

2.5 Data cleaning after visualization

2.5.1 Noise/outliers processing

df1 <- df
df1 <- df1 %>% filter(monthly_rent_rm > 400 & monthly_rent_rm < 3500)
df1 <- df1 %>% filter(size_sqft > 100 & size_sqft < 2000)
df1 <- df1 %>% filter(rooms > 0 & rooms < 4)
df1 <- df1 %>% filter(parking > 0 & parking < 4)
df1 <- df1 %>% filter(bathroom > 0 & bathroom < 4)
df1 <- df1[df1$property_type %in% c("Apartment", "Condominium", "Service Residence"), ]
  • After visualization, as you can see, we filter the most suitable range for the numeric variables.
  • And for property_type, it can be found that 95% of data are in “Apartment”, “Condominium”, “Service Residence”. So, we only keep these 3 types of data.

2.5.2 Categorical to numerical transformation

df1$furnished <- round(as.numeric(factor(df1$furnished, levels = c("Not Furnished", "Partially Furnished", "Fully Furnished"), labels = c(1, 2, 3))),0)
df1$region <- round(as.numeric(factor(df1$region, levels = c("Kuala Lumpur", "Selangor"), labels = c(1, 2))),0)
df1$nearby_ktm_lrt <- round(as.numeric(factor(df1$nearby_ktm_lrt, levels = c("yes", "no"), labels = c(2, 1))),0)
df1$property_type <- round(as.numeric(factor(df1$property_type, levels = c("Condominium", "Apartment", "Service Residence"), labels = c(1, 2, 3 ))),0)
  • For categorical data transformation, we transform it to 1,2,3 based the unique value for each type except location.

2.5.3 count unique value for categorical data

is_categorical <- df1[,sapply(df1, function(x) is.factor(x) || is.character(x))]
print(paste('The number of unique values for categorical data:'))
## [1] "The number of unique values for categorical data:"
sapply(is_categorical, function(x) length(unique(x)))
## location 
##       90
  • Next, we count the number of Unique values for all the categorical variables.
  • The same time we want to check the if all categorical data transform successfully.

2.5.4 Location filter

# count the number of rows for each unique value
location_counts <- df1 %>%
  group_by(location) %>%
  summarise(n = n())
# get the unique value count greater than 100
selected_locations <- location_counts %>%
  filter(n > 100) %>%
  pull(location)
# keep the unique value count greater than 100
df1 <- df1 %>%
  filter(location %in% selected_locations)
  • We tried to drop location first, however, r2-score for regression models and accuracy for classification models are lower around 10%.
  • So, we keep the location processing. Firstly, we count the number of each unique values of location, and remove the unique values counted less than 100 rows.

2.5.5 Location processing

# transform unique value of in category of X to new column
X_cat <- df1[,sapply(df1, function(x) is.factor(x) || is.character(x))]
X_cat_ohe <- dummyVars(~ . , data = X_cat)
X_cat_ohe <- predict(X_cat_ohe, newdata = X_cat)
  • Next, We convert the unique values that appear in all locations into new dummy variables through the dummyVars() function, and the values of these columns are displayed in binary form.
  • And we cbind the numeric data imputed and the location together.

2.5.6 Null value processing

#NA processing
colSums(is.na(df1))
## completion_year monthly_rent_rm        location   property_type           rooms 
##            4548               0               0               0               0 
##         parking        bathroom       size_sqft       furnished          region 
##               0               0               0               0               0 
##       fac_count   add_fac_count  nearby_ktm_lrt 
##             843            2741            2741
# Apply numerical imputation
num_name <- colnames(df1)[sapply(df1, function(x) is.numeric(x) || is.integer(x))]
X_num <- df1[, num_name]
X_num_imputed <- numericalImputation(X_num, strategy = "mode")
colSums(is.na(X_num_imputed))
## completion_year monthly_rent_rm   property_type           rooms         parking 
##               0               0               0               0               0 
##        bathroom       size_sqft       furnished          region       fac_count 
##               0               0               0               0               0 
##   add_fac_count  nearby_ktm_lrt 
##               0               0
df1 <- cbind(X_num_imputed, X_cat_ohe)
  • Next we check the NA value for all the columns.
  • If can be found that for location there is no NA value, all NA are in numerical variables.
  • So, we fill the NA value by using mode. Here we didn’t use mean because if we using mean, the value filled might not be integer for some data, such as parking number.

2.6 Regression model pre-processing

2.6.1 X/y split, train/test split

# Call the function and assign the results to X and y
result <- extractInputOutput(data = df1, output_column_name = "monthly_rent_rm")
X1 <- result$input_data
y1 <- result$output_data
X_clean1 <- standardizerData(X1)
set.seed(123)
train_indices1 <- createDataPartition(y1, p = 0.8, list = FALSE)

# Split the data into training and testing sets for regression model
X_train1 <- X_clean1[train_indices1, ]
y_train1 <- y1[train_indices1]
X_test1 <- X_clean1[-train_indices1, ]
y_test1 <- y1[-train_indices1]
cat("Proportion of test set:", round(nrow(X_test1) / nrow(X_clean1), 2))
## Proportion of test set: 0.2
  • Before building models for regression, I split the dataset set the monthly_rent_rm as y value.
  • Then Split to train set and test set.
  • The proportion of the test set is 20%.

2.7 Train regression model

2.7.1 Linear regression

## linear regression model
y_baseline <- rep(mean(y_train1), length(y_train1))
start_time_lr <- Sys.time()
# Train the linear regression model
model_lr <- lm(y_train1 ~ ., data = X_train1)
end_time_lr <- Sys.time()
time_difference_lr <- end_time_lr - start_time_lr

# Predict using the test data
y_pred_lr_train <- round(predict(model_lr, X_train1),0)
# Calculate metrics
lr_score_train <- calculate_metrics(y_train1, y_pred_lr_train)

plot(y_train1, y_pred_lr_train, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_lr_train), col = "red")

# Predict using the test data
y_pred_lr_test <- round(predict(model_lr, X_test1),0)
# Calculate metrics
lr_score_test <- calculate_metrics(y_test1, y_pred_lr_test)

plot(y_test1, y_pred_lr_test, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_lr_test), col = "red")

2.7.2 Gradient Boosting Regression

start_time_gbr <- Sys.time()
# Build Gradient Boosting Regression model
model_gbr <- gbm(y_train1 ~ ., data = X_train1, n.trees = 100, 
                 distribution = "gaussian", interaction.depth = 4, shrinkage = 0.1, 
                 n.minobsinnode = 10, bag.fraction = 0.5, train.fraction = 0.8, 
                 cv.folds = 5, verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1   249821.4660     223845.3719     0.1000 18009.7644
##      2   234682.3656     211888.8292     0.1000 15311.9433
##      3   221694.2735     201495.1385     0.1000 13211.0232
##      4   210494.9206     192986.3171     0.1000 11478.8619
##      5   200848.5780     184834.0948     0.1000 9083.6325
##      6   192724.1202     179578.1656     0.1000 7552.5611
##      7   185941.5911     173882.7275     0.1000 6469.6144
##      8   179596.7761     168041.7649     0.1000 6148.0347
##      9   173787.9203     163247.1498     0.1000 5476.3547
##     10   168773.0376     159266.6308     0.1000 4728.8987
##     20   134899.3940     129697.8388     0.1000 2660.6819
##     40   107820.4139     103505.6837     0.1000  586.2031
##     60    95651.3158      90540.1896     0.1000  337.5581
##     80    88944.8184      83961.4944     0.1000   54.5086
##    100    84705.8242      80010.2710     0.1000   -2.6924
end_time_gbr <- Sys.time()
time_difference_gbr <- end_time_gbr - start_time_gbr

y_pred_gbr_train <- predict(model_gbr, newdata = X_train1, n.trees = 100, type = "response")
# Calculate mean absolute error and R- suqared
gbr_score_train <- calculate_metrics(y_train1, y_pred_gbr_train)
# Print R-squared and MAE scores

plot(y_train1, y_pred_gbr_train, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_gbr_train), col = "red")

# Predict on the test set
y_pred_gbr_test <- predict(model_gbr, newdata = X_test1, n.trees = 100, type = "response")
# Calculate mean absolute error and R- suqared
gbr_score_test <- calculate_metrics(y_test1, y_pred_gbr_test)
# Print R-squared and MAE scores

plot(y_test1, y_pred_gbr_test, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_gbr_test), col = "red")

2.7.3 Random Forest Regression

# Build random forest regression model
colnames(X_train1) <- gsub(" ", "_", colnames(X_train1))

start_time_rf <- Sys.time()
model_rf <- randomForest(y_train1 ~ ., data = X_train1)
end_time_rf <- Sys.time()
time_difference_rf <- end_time_rf - start_time_rf

# Predict train
y_pred_rf_train <- predict(model_rf, newdata = X_train1)
# Calculate mean absolute error and R2-squared
rf_score_train <- calculate_metrics(y_train1, y_pred_rf_train)

plot(y_train1, y_pred_rf_train, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for train set")
lines(smooth.spline(y_train1, y_pred_gbr_train), col = "red")

# Predict test
y_pred_rf_test <- predict(model_rf, newdata = X_test1)
# Calculate mean absolute error and R2-squared
rf_score_test <- calculate_metrics(y_test1, y_pred_rf_test)

plot(y_test1, y_pred_rf_test, type = "p",
     xlab = "Actual Rent",
     ylab = "Predicted Rent",
     main = "Actual Rent vs Predicted Rent for test set")
lines(smooth.spline(y_test1, y_pred_gbr_test), col = "red")

  • According to the fit chart, random forest has the best fit.

2.7.4 Create the summary DataFrame

summary_df <- data.frame(
  Model = c("LR", "GBR", "RFR"),  # Update with model names
  MAE_Score_train = c(lr_score_train$MAE_score, 
                      gbr_score_train$MAE_score, 
                      rf_score_train$MAE_score),  # Update with MAE scores
  R2_Score_train = c(lr_score_train$R2_score, 
                     gbr_score_train$R2_score, 
                     rf_score_train$R2_score), # Update with R2 scores
  MAE_Score_test = c(lr_score_test$MAE_score, 
                      gbr_score_test$MAE_score, 
                      rf_score_test$MAE_score),  # Update with MAE scores
  R2_Score_test = c(lr_score_test$R2_score, 
                     gbr_score_test$R2_score, 
                     rf_score_test$R2_score), # Update with R2 scores
  Modeling_Time_taken = round(c(time_difference_lr, time_difference_gbr, time_difference_rf),2)
)
# Sort the DataFrame by R2-Score in descending order
regression_results <- summary_df[order(summary_df$R2_Score_test, decreasing = TRUE), ]
regression_results
##   Model MAE_Score_train R2_Score_train MAE_Score_test R2_Score_test
## 3   RFR           92.28         0.9334          160.4        0.7938
## 2   GBR           212.8         0.6803          214.1        0.6764
## 1    LR           240.7         0.5981          238.3        0.6053
##   Modeling_Time_taken
## 3          88.68 secs
## 2          11.68 secs
## 1           0.04 secs
  • I used MAE_Score, R^2_Score to evaluate the three models.
  • MAE score is a robust measure of how accurately the model predicts the target variable. Lower MAE values are better as they indicate less error.
  • r^2 score provides a measure of how well future outcomes are likely to be predicted by the model. R^2 of 1 indicates that the regression predictions perfectly fit the data.
  • The performance of random forest has the lowest MAE score and highest r^2 score for both train set and test set. However, it takes longest time to train the model. Also, compare to the score with train set and test set, random forest seems has the largest differences. Which mean the score for random forest is not stable, not so reliable. Even so, the random forest regression model far outperforms than LR and GBR. Thus, random forest is the winner among the three regression models.

2.8 Classification processing

2.8.1 Create new target renovation_level

q33 <- quantile(df1$monthly_rent_rm/ df1$rooms, probs = 0.33)
q67 <- quantile(df1$monthly_rent_rm/ df1$rooms, probs = 0.67)

df1$renovation_level <- ifelse(df1$monthly_rent_rm / df1$rooms < 450, "standard",
                              ifelse(df1$monthly_rent_rm / df1$rooms >= 450 & 
                                       df1$monthly_rent_rm / df1$rooms <= 666,
                                     "exquisite","luxury"))
df1$renovation_level <- round(as.numeric(factor(df1$renovation_level, levels = c("luxury", "exquisite", "standard"), labels = c(3, 2, 1))),0)
  • Since this project requires 2 research questions, which are regression and classification.
  • Our dataset doesn’t have another meaningful classification target.
  • So, we decided to create a new target variable based on the range of monthly_rent_rm.
  • The new target variable is renovation_level, and it has three types of class which are luxury, exquisite, and standard.

2.8.2 X/y split, train/test split

df2 <- df1
result2 <- extractInputOutput(data = df2, output_column_name = "renovation_level")
X2 <- result2$input_data
y2 <- result2$output_data
X_clean2 <- standardizerData(X2)

# Train-test split
# Create the train and test indices
train_indices2 <- createDataPartition(y2, p = 0.8, list = FALSE)

# Split the data into training and testing sets
X_train2 <- X_clean2[train_indices2, ]
y_train2 <- y2[train_indices2]
X_test2 <- X_clean2[-train_indices2, ]
y_test2 <- y2[-train_indices2]
round(nrow(X_test2) / nrow(X_clean2), 2)
## [1] 0.2

2.8.3 Classification variables exploration

# classification variables exploration
# draw the correlation heat map
table(y_test2)
## y_test2
##   1   2   3 
## 694 763 695
# From the table(y_test2), we can view the 3 types of class distribution to ensure that there is not much difference in the amount of 3 types of data.

corr <- c("monthly_rent_rm","completion_year", "rooms", "parking", "bathroom", "size_sqft", 
          "fac_count", "add_fac_count","region", "nearby_ktm_lrt","property_type","renovation_level")
heat <- df2[,corr]

options(repr.plot.width=8, repr.plot.height=6)

qplot(x=Var1, y=Var2, data=melt(cor(heat, use="p")), fill=value, geom="tile") +
  scale_fill_gradient2(low = "green", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme_minimal()+ 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 8, hjust = 1))+
  coord_fixed()+
  ggtitle("Figure 7 Correlation Heatmap") +
  theme(plot.title = element_text(hjust = 0.4))

2.9 Build Classification Model

2.9.1 Decision Tree Classification

# Create the decision tree model
X_train2 <- X_train2[, !(names(X_train2) %in% c("monthly_rent_rm"))]
X_test2 <- X_test2[, !(names(X_test2) %in% c("monthly_rent_rm"))]
ctrl=rpart.control(cp=0.001)
model_dt <- rpart(y_train2 ~ ., data = X_train2, method = 'class', control=ctrl)
# Predict using the tuned decision tree model
y_pred_dt <- predict(model_dt, X_test2, type = 'class')
# Get the predicted labels based on the highest column index
pre_mat_dt <- table(y_test2, y_pred_dt)
# print accuracy based on the confusion matrix prediction
acc_test_dt <- sum(diag(pre_mat_dt)) / sum(pre_mat_dt)
print(paste('Accuracy for Decision Tree test:', round(acc_test_dt, 4)))
## [1] "Accuracy for Decision Tree test: 0.7728"
  • For classification model, we draw three types as well.
  • Since the classification target is generate based on the monthly_rent, so before build the model, first we need to remove the monthly_rent_rm variable. otherwise the accuracy rate will be falsely high.
  • The first one we used is Decision Tree.
  • As you can see, the accuracy for decision tree is around 77%.

2.9.2 Naive Bayes Classification

# Naive Bayes using in-built package e1071
NBclassfier <- naiveBayes(y_train2 ~ ., data = X_train2)
y_pred_NB <- predict(NBclassfier, newdata = X_test2)  
pre_mat_NB <- table(y_test2, y_pred_NB)
acc_test_NB <- sum(diag(pre_mat_NB)) / sum(pre_mat_NB)
print(paste('Accuracy for Naive Bayes test:', round(acc_test_NB, 4)))
## [1] "Accuracy for Naive Bayes test: 0.619"

2.9.3 Random Forest Classification

# Build random forest for classification 
rf_classifer <- randomForest(y_train2 ~ ., data = X_train2, importance =TRUE,ntree=500,nodesize=7, na.action=na.roughfix)
# Predict
y_pred_rf2 <- round(predict(rf_classifer, newdata = X_test2))
pre_mat_rf2 <- table(y_test2, y_pred_rf2)
acc_test_rf <- sum(diag(pre_mat_rf2)) / sum(pre_mat_rf2)
print(paste('Accuracy for random forest test:', round(acc_test_rf, 4)))
## [1] "Accuracy for random forest test: 0.8397"

2.9.4 classification results

classification_results <- list(pre_mat_dt, pre_mat_NB, pre_mat_rf2)
# Calculate metrics
classification_results <- classification_results_metrics(classification_results)
classification_results
##    Accuracy Precision 1 Precision 2 Precision 3 Recall 1 Recall 2 Recall 3
## DT   0.7728      0.8382      0.6803      0.8233   0.8285   0.7418   0.7511
## NB   0.6190      0.7674      0.5163      0.6651   0.5562   0.6855   0.6086
## RF   0.8397      0.9232      0.7485      0.8746   0.8660   0.8309   0.8230
##    F1 Score 1 F1 Score 2 F1 Score 3
## DT     0.8333     0.7097     0.7856
## NB     0.6449     0.5890     0.6356
## RF     0.8937     0.7876     0.8480
  • These are the comparation results for classification models.

  • Here I used Accuracy, precision, recall, and the F1-score to evaluate their performance, since they can provide different perspectives on the model’s performance.

  • Accuracy is the proportion of correct predictions which is true positives and true negatives among the total number of cases examined. It’s a suitable measure when the classes are balanced, but can be misleading if the class distribution is imbalanced. For test set, the classes of y is balanced, so it is a reliable measures.

  • Precision is the proportion of true positives out of all positive predictions. It is a measure of a classifier’s exactness. A low precision can also indicate a high number of false positives. So, I want to know the Positive Predictive rate of the model.

  • Recall is the proportion of true positives out of all actual positives which is true positives + false negatives. Recall is a measure of a classifier’s completeness. A low recall indicates many false negatives. I want to know the True Positive Rate of the model.

  • The F1-score is the harmonic mean of precision and recall, and tries to balance the two. It is especially useful if the dataset have an uneven class distribution, as it seeks a balance between precision and recall. Here, I want to use F1-score as a reference.

  • As the results, we have 3 conclusion

    1. The precision for class 1 of all the three model are greater than accuracy, class 2 of all the three model are lower than accuracy and precision for class 1 and 2 very much, which mean the models are less confident and less correct when predicting Class 2. It could imply that Class 2 is more difficult for the models to correctly identify, possibly due to reasons such as overlap of features with other classes, lack of distinctive features, or poor quality of labels for Class 2.
    1. The precision, recall and F1-score for class 1 is clearly higher than 2 and 3 for all the three models. As the three classes are balanced. So, Class 1 might have more distinct or clear-cut features that make it easier to identify compared to classes 2 and 3.
    1. Random forest has the best performance.

3 Conclusion

Thus, the end-users can now know the rental price of a type of property in certain area. They can also know the type of property (standard, luxury, exquisite) according to the specification of the property.