Project Title: House Rental Market in Kuala Lumpur, a simple analysis

1 Introduction

1.1 Introduction to dataset

The aim of our group project is to analyze the latest trends in Kuala Lumpur’s house rental market. There aren’t any dataset with latest rental data are available on the web, so we have chosen to scrape from one of the most popular property portal in Malaysia: www.iproperty.com.my. Our team has decided to perform raw collection on 16th May 2021, for the following 24 areas in Kuala Lumpur, we managed to collect 4759 data rows and 18 variables. You can access the dataset from our github repo.

Target Areas

We have selected 24 areas in Kuala Lumpur:

Ampang Bukit Jalil Mont Kiara Sri Petaling
Bangsar Cheras Segambut Sungai Besi
Bangsar South Damansara Heights Sentul Taman Desa
Batu Caves Kepong Seputeh TitiWangsa
Brickfields KLCC Setapak Taman Tun Dr Ismail
Bukit Bintang Kuchai Lama Sri Hartamas Wangsa Maju

Data collection scope

We limit our scope of this project to study only condo and serviced apartments in Kuala Lumpur area. We omit the apartment, terrace houses, etc. Our principal data collection is based on the following filters:
  • Target Area: Kuala Lumpur
  • Residential Type: Condo/Serviced Apartments
  • Category: Rental
  • Sort type: recent

Description of variables

The details of the 18 features are as follows:

[1] name: full name with area
[2] propertyName: the name of property
[3] rental: monthly rental price
[4] pricePerSizeUnitBuiltUp: rental price per built up size
[5] address: address of the property
[6] lat: latitude
[7] lng: longitude
[8] city: city name
[9] district: district name
[10] bedrooms: total number of bedrooms
[11] bathrooms: total number of bathrooms
[12] builtUp: built up size in square feet
[13] carParks: total number of car parks
[14] furnishing: furnishing type - fully/partially furnished or unfurnished
[15] postedDate: the date of the ad posted on the site
[16] buildingId: unique ID of the building
[17] unitType: unit type, for example, corner lot
[18] propertyType: property type, condo or serviced apartment

Libraries Used

R package used for our project are as follows:

#web-scraping
library(selectr)
library(xml2)
library(rvest)
library(jsonlite)

#data pre-processing
library(plyr)
library(dplyr)
library(tidyverse)
library(Amelia)

#Charts
library(ggplot2)
library(corrplot)

#display/formatting
library(knitr)
library(gridExtra)
library(kableExtra)

#Linear Regression/Classification
library(PerformanceAnalytics)
library(caret)
library(rmarkdown)
library(randomForest)

1.2 Research Questions

Here are two research questions that we are interested in:
  1. Are we able to classify whether the rental price is below or above average?
  2. How to predict the rental price using linear regression ?

2. Data Curation

2.1 Brief Explanation

We have initially planned to use rvest package to scrape from the website, but the site seems to ban the user-agent of web scrapers, it returns “ERROR 429”. So, we downloaded the HTML pages manually. After checking the source code, we found the JSON in their HTML source. We used xml2 package with rvest and selectr to extract the dataset we need. First, we extract all JSON files using jsonlite, then export to csv.

2.2 Web Scraping script

setwd('/Users/n2n/RWorkingDirectory/WQD7004_group_project_git/')

pagesNum <- c(1:10)
inputFileNamePrefix <- "iprop_wangsa_maju"
outputCsvFileName <- paste0(inputFileNamePrefix, "_df.csv")

#create a function to extract json, and write to file
extractJsonFromPage <- function(inputFileName, outputFileName){
  iprop_html <- read_html(inputFileName)
  script_nodes <- tail(iprop_html %>%
                         html_nodes("script"))
  pop <- html_text(script_nodes[[5]])
  startIndex <- gregexpr('asyncData', pop)
  endIndex <- gregexpr('false}}', pop)
  jsonString <- substr(pop, startIndex[[1]] - 2, endIndex[[1]] + 6)

  writeLines(jsonString, outputFileName)
}

for(i in pagesNum){
  inputFileName <- paste0(inputFileNamePrefix, i, ".html")
  outputFileName <- paste0(inputFileNamePrefix, i, ".json")
  extractJsonFromPage(inputFileName, outputFileName)
}

#------------------------------
#read JSON file and convert to csv
#------------------------------
#create an empty dataframe with 14 column and add column names
df <- data.frame(matrix(ncol = 14, nrow = 0), stringsAsFactors = FALSE)

#read json file
convertJsonToList <- function(inputFileName){

  propJson <- read_json(inputFileName)
  listings <- (propJson['listings'])[[1]]
  listingItems <- (listings['items'])[[1]]

  mylist <- list() #create an empty list

  for(i in 1:length(listingItems)){
    listingItem1 <- listingItems[[i]]
    fullName <- listingItem1$title
    rental <- listingItem1$prices[[1]]$min
    address <- listingItem1$address$formattedAddress
    lat <- listingItem1$address$lat
    lng <- listingItem1$address$lng
    city <- listingItem1$multilanguagePlace$`en-GB`$level1
    district <- listingItem1$multilanguagePlace$`en-GB`$level2
    propertyName <- listingItem1$multilanguagePlace$`en-GB`$level3
    bedrooms <- listingItem1$attributes$bedroom
    bathrooms <- listingItem1$attributes$bathroom
    builtUp <- listingItem1$attributes$builtUp
    carParks <- listingItem1$attributes$carPark
    furnishing <- listingItem1$attributes$furnishing
    pricePerSizeUnitBuiltUp <- listingItem1$attributes$pricePerSizeUnitBuiltUp
    buildingId <- listingItem1$attributes$buildingId
    unitType <- listingItem1$attributes$unitType
    postedDate <- listingItem1$postedAt
    propertyType <- listingItem1$propertyType

    mylist[[i]] <- list(fullName, propertyName, rental, pricePerSizeUnitBuiltUp, address,
                        lat, lng, city, district, bedrooms, bathrooms, builtUp, carParks,
                            furnishing, postedDate, buildingId, unitType, propertyType)
  }

  return(mylist)
}

myDfList <- list()
for(i in 1:10){
  inputFileName <- paste0(inputFileNamePrefix, i , ".json")
  myDfList[[i]] <- convertJsonToList(inputFileName)
}

mainDf <- data.frame(matrix(ncol = 14, nrow = 0), stringsAsFactors = FALSE)
for(i in 1:10){
  temp_df <- do.call("rbind", myDfList[[i]])
  mainDf <- rbind(mainDf, temp_df)
}

titles <- c("name", "propertyName", "rental","pricePerSizeUnitBuiltUp", "address",
            "lat", "lng", "city", "district", "bedrooms", "bathrooms", "builtUp", "carParks",
            "furnishing", "postedDate", "buildingId", "unitType",  "propertyType")

# First coerce the data.frame to all-character
mainDf <- data.frame(lapply(mainDf, as.character), stringsAsFactors=FALSE)
write.table(mainDf, file = outputCsvFileName, sep = ",", col.names = titles, row.names = FALSE)

2.3 Merge to master dataframe

We extracted JSON from HTML, and converted the json files to csv. All the 24 csv files under the directory were merged using ldply function as follow:

setwd("/Users/n2n/RWorkingDirectory/WQD7004/merged_files")
dataset <- ldply(list.files(), read.csv, header=TRUE)
titles <- c("name", "propertyName", "rental","pricePerSizeUnitBuiltUp", "address",
            "lat", "lng", "city", "district", "bedrooms", "bathrooms", "builtUp", "carParks",
            "furnishing", "postedDate", "buildingId", "unitType",  "propertyType")
write.table(dataset, file = "iprop_master_dataframe.csv", sep = ",", col.names = titles, row.names = FALSE)

2.4 Description of raw dataset

We will import the csv into master dataframe, and we call it “master_data”. Subsequent analysis will be based on this dataframe.

master_data <- read.csv("/Users/n2n/RWorkingDirectory/WQD7004_group_project_git/iprop_master_dataframe.csv")

Dimension & Preview

The dimension of the original dataset:

#dimension of the data frame
dim(master_data)
## [1] 4759   18

We use the package paged_table to display the first 20 rows from the dataset:

#First 10 rows of the dataset
paged_table(head(master_data,20))

Structure

# Determine the structure of the dataset
str(master_data)
## 'data.frame':    4759 obs. of  18 variables:
##  $ name                   : chr  "M City, Ampang" "M City, Ampang" "M City, Ampang" "M City, Ampang" ...
##  $ propertyName           : chr  "M City" "M City" "M City" "M City" ...
##  $ rental                 : int  1300 1550 1800 2200 1350 1600 2200 1550 1650 2050 ...
##  $ pricePerSizeUnitBuiltUp: chr  "2.44" "1.98" "2.3" "2.25" ...
##  $ address                : chr  "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" ...
##  $ lat                    : chr  "3.15934" "3.15934" "3.15934" "3.15934" ...
##  $ lng                    : chr  "101.746861" "101.746861" "101.746861" "101.746861" ...
##  $ city                   : chr  "Kuala Lumpur" "Kuala Lumpur" "Kuala Lumpur" "Kuala Lumpur" ...
##  $ district               : chr  "Ampang" "Ampang" "Ampang" "Ampang" ...
##  $ bedrooms               : chr  "Studio" "1" "1" "2" ...
##  $ bathrooms              : chr  "1" "1" "1" "2" ...
##  $ builtUp                : chr  "533" "781" "781" "977" ...
##  $ carParks               : chr  "1" "1" "1" "2" ...
##  $ furnishing             : chr  "Fully furnished" "Fully furnished" "Fully furnished" "Fully furnished" ...
##  $ postedDate             : chr  "2021-05-13T05:37:19.004Z" "2021-05-13T05:35:23.069Z" "2021-05-13T05:33:01.874Z" "2021-05-13T05:31:28.698Z" ...
##  $ buildingId             : chr  "2247" "2247" "2247" "2247" ...
##  $ unitType               : chr  "NULL" "NULL" "NULL" "Intermediate" ...
##  $ propertyType           : chr  "Serviced Residence" "Serviced Residence" "Serviced Residence" "Serviced Residence" ...

Summary

# Summary of data
summary(master_data)
##      name           propertyName           rental       
##  Length:4759        Length:4759        Min.   :    400  
##  Class :character   Class :character   1st Qu.:   1600  
##  Mode  :character   Mode  :character   Median :   2000  
##                                        Mean   :   4089  
##                                        3rd Qu.:   2900  
##                                        Max.   :1900000  
##  pricePerSizeUnitBuiltUp   address              lat           
##  Length:4759             Length:4759        Length:4759       
##  Class :character        Class :character   Class :character  
##  Mode  :character        Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##      lng                city             district           bedrooms        
##  Length:4759        Length:4759        Length:4759        Length:4759       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   bathrooms           builtUp            carParks          furnishing       
##  Length:4759        Length:4759        Length:4759        Length:4759       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   postedDate         buildingId          unitType         propertyType      
##  Length:4759        Length:4759        Length:4759        Length:4759       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

3. Data Pre-processing

3.1 Remove rows

First, we are going remove the rows that doesn’t have a propertyName, there are a total of 20 rows. We also noticed that there is one record of propertype Apartment, since our study only focused on Condo and Serviced Apartments, we are going to remove this record too.

sprintf("Number of listings without a name: %s", count(master_data[master_data$propertyName == 'NULL',])) 
## [1] "Number of listings without a name: 20"
master_data <- master_data %>% filter(propertyName != 'NULL')
master_data <- master_data %>% filter(!(propertyType == 'Apartment'))
count(master_data) #4738 records
##      n
## 1 4738

3.2 Update the data type

Based on the structure, all columns are characters except rental column. So, we need to convert the data types first.

master_data$lat <- as.numeric(master_data$lat, options(digits = 6))
master_data$lng <- as.numeric(master_data$lng,options(digits = 6))
master_data$bathrooms <-as.numeric(master_data$bathrooms)
master_data$builtUp <-as.numeric(gsub(",","",master_data$builtUp))
master_data$carParks <-as.numeric(master_data$carParks)
master_data$postedDate <-as.Date(master_data$postedDate)
master_data$pricePerSizeUnitBuiltUp<-as.numeric(master_data$pricePerSizeUnitBuiltUp)
master_data$buildingId <-as.numeric(master_data$buildingId)

3.3 Check for missing values

Total number of rows with missing values

sum(is.na(master_data))
## [1] 598

Let’s check where are the missing value located:

colSums(is.na(master_data))
##                    name            propertyName                  rental 
##                       0                       0                       0 
## pricePerSizeUnitBuiltUp                 address                     lat 
##                     160                       0                       1 
##                     lng                    city                district 
##                       1                       0                       0 
##                bedrooms               bathrooms                 builtUp 
##                       0                       1                      15 
##                carParks              furnishing              postedDate 
##                     420                       0                       0 
##              buildingId                unitType            propertyType 
##                       0                       0                       0

Next, we draw a missingness map to visualise:

missmap(master_data)

3.4 Impute missing values

We round up the number of carpark and bathroom rounded up due to they should be as a whole number and we are going to ignore lat and lng due to it would not affect our analysis.

master_data_imputed <- master_data %>%  
  group_by(district) %>%
  mutate(builtUp = ifelse(is.na(builtUp),mean(builtUp,na.rm=TRUE),builtUp),
         carParks = ifelse(is.na(carParks),round(mean(carParks,na.rm=TRUE),digits = 0),carParks),
         bathrooms = ifelse(is.na(bathrooms),round(mean(bathrooms,na.rm=TRUE),digits = 0),bathrooms),
         lat = ifelse(is.na(lat),round(mean(lat,na.rm=TRUE),digits = 0),lat),
         lng = ifelse(is.na(lng),round(mean(lng,na.rm=TRUE),digits = 0),lng),
         pricePerSizeUnitBuiltUp = ifelse(is.na(pricePerSizeUnitBuiltUp),round(mean(pricePerSizeUnitBuiltUp,na.rm=TRUE),digits = 0),pricePerSizeUnitBuiltUp))

Let’s verify the missing values again

colSums(is.na(master_data_imputed))
##                    name            propertyName                  rental 
##                       0                       0                       0 
## pricePerSizeUnitBuiltUp                 address                     lat 
##                       0                       0                       0 
##                     lng                    city                district 
##                       0                       0                       0 
##                bedrooms               bathrooms                 builtUp 
##                       0                       0                       0 
##                carParks              furnishing              postedDate 
##                       0                       0                       0 
##              buildingId                unitType            propertyType 
##                       0                       0                       0

3.5 Detect outliers

We are going to detect outliers in the based on the percentiles. With the percentiles method, all observations that lie outside the interval formed by the 2.5 and 97.5 percentiles will be considered as potential outliers. Other percentiles such as the 1 and 99, or the 5 and 95 percentiles can also be considered to construct the interval.

Based on the summary of data, we can observed that the main feature (i.e. rental) has a huge range. Let’s check the summary and boxplot:

summary(master_data_imputed$rental)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     400    1600    2000    4050    2900 1900000

Boxplot of rental price

# Store the graph
box_plot <- ggplot(master_data_imputed, aes(y = rental))
# Add the geometric object box plot
box_plot +
   geom_boxplot() +coord_flip()+ggtitle("Overall Price Boxplot")

Subsequently, we are going to identify the potential outliers in the column rental price with Percentile methods.

lower_bound <- quantile(master_data_imputed$rental, 0.025)
upper_bound <- quantile(master_data_imputed$rental, 0.975)

outlier_ind <- master_data_imputed %>%
   filter(rental < quantile(master_data_imputed$rental, 0.025) | rental > quantile(master_data_imputed$rental, 0.975))
# A total 235 record found as outliers with the range of rental between 1150 and 12000

## exclude outlier from the data
master_data_clean <-  master_data_imputed %>%
   filter(rental <= quantile(master_data$rental, 0.975) & rental >= quantile(master_data$rental, 0.025))

We will examine the summary of rental price again after removing the outliers with percentile method.

## Summary of rental after removing outliers
summary(master_data_clean$rental)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1100    1600    2000    2588    2800   11000

We noticed outliers in carParks column too. We will replacing those values with 1.

master_data_clean[master_data_clean$carParks > 20, "carParks"] <- 1

3.6 Standardise values

Furthermore, we noticed that the column “furnishing” is not standardized when we check for the unique values.

unique(master_data_clean$furnishing)
## [1] "Fully furnished"  "NULL"             "Partly furnished" "Fully Furnished" 
## [5] "Unfurnished"      "Partly Furnished"

Hence, we need to do standardization for furnishing column.

master_data_clean <- master_data_clean %>%
   mutate(furnishing = ifelse(furnishing == "Fully furnished" | furnishing == "Fully Furnished", "Fully Furnished",
                               ifelse(furnishing == "Partly furnished" | furnishing == "Partly Furnished", "Partly Furnished",
                                      ifelse(furnishing == "NULL", "Unknown",
                                      furnishing))))

Let’s check the values again after standardization:

unique(master_data_clean$furnishing)
## [1] "Fully Furnished"  "Unknown"          "Partly Furnished" "Unfurnished"

3.7 Convert columns to factor

We noticed that bedrooms, furnishing types, property type, unit type and districts columns needed to be converted to factors. We will do this conversion using dplyr package:

master_data_clean <- master_data_clean %>%
  mutate(furnishing=factor(furnishing)) %>%
  mutate(bedrooms=factor(bedrooms)) %>%
  mutate(propertyType=factor(propertyType)) %>%
  mutate(unitType=factor(unitType)) %>%
  mutate(district=factor(district)) 

3.8 Drop unused features

There are total 18 features in the dataset that we have curated, we will drop some of the features that we will not be using for analysis:

master_data_clean <- master_data_clean %>% 
  select(propertyName, rental, pricePerSizeUnitBuiltUp, lat, lng, district, bedrooms, bathrooms, builtUp, carParks, furnishing, buildingId, unitType, propertyType)

3.9 Final dataset for analysis

By now, we have handled all the data pre-processing steps. And the dataframe dimension after removing outliers is 4549 observations with 14 features.

Summary of data

summary(master_data_clean)
##  propertyName           rental      pricePerSizeUnitBuiltUp      lat      
##  Length:4549        Min.   : 1100   Min.   : 0.78           Min.   :3.00  
##  Class :character   1st Qu.: 1600   1st Qu.: 1.71           1st Qu.:3.11  
##  Mode  :character   Median : 2000   Median : 2.14           Median :3.15  
##                     Mean   : 2588   Mean   : 2.40           Mean   :3.14  
##                     3rd Qu.: 2800   3rd Qu.: 2.80           3rd Qu.:3.18  
##                     Max.   :11000   Max.   :38.00           Max.   :3.25  
##                                                                           
##       lng                                  district       bedrooms   
##  Min.   :102   Wangsa Maju                     : 209   3      :1778  
##  1st Qu.:102   Kampung Kerinchi (Bangsar South): 200   2      : 945  
##  Median :102   Segambut                        : 200   3+1    : 559  
##  Mean   :102   Sungai Besi                     : 199   1      : 342  
##  3rd Qu.:102   Taman Desa                      : 199   Studio : 271  
##  Max.   :102   Brickfields                     : 198   4      : 216  
##                (Other)                         :3344   (Other): 438  
##    bathrooms       builtUp          carParks               furnishing  
##  Min.   :1.00   Min.   :    50   Min.   :1.00   Fully Furnished :2609  
##  1st Qu.:2.00   1st Qu.:   823   1st Qu.:1.00   Partly Furnished:1193  
##  Median :2.00   Median :  1021   Median :1.00   Unknown         : 684  
##  Mean   :2.13   Mean   :  1250   Mean   :1.46   Unfurnished     :  63  
##  3rd Qu.:2.00   3rd Qu.:  1300   3rd Qu.:2.00                          
##  Max.   :7.00   Max.   :184368   Max.   :4.00                          
##                                                                        
##    buildingId           unitType                propertyType 
##  Min.   :   6   NULL        :2610   Condominium       :2225  
##  1st Qu.:1737   Intermediate:1068   Serviced Residence:2324  
##  Median :2857   Corner lot  : 733                            
##  Mean   :3072   Duplex      :  40                            
##  3rd Qu.:4195   End lot     :  38                            
##  Max.   :7013   Studio      :  28                            
##                 (Other)     :  32

Clean Dataset

paged_table(master_data_clean)

4. Exploratory Data Analysis

4.1 Rental by district

# Store the graph
box_plot <- ggplot(master_data_clean, aes(x = district,y = rental))
# Add the geometric object box plot
box_plot +
   geom_boxplot() +coord_flip()+ggtitle("Boxplot of Rental by District")

4.2 Top 10 most expensive areas

Usually, mean value will be taken to plot the average. But in this case, the data is not evenly distributed, and a lot of outliers detected from the boxplot in section 4.1. Hence, it is more reasonable to take the median value for comparison instead of mean.

group_by_district <- aggregate(list(master_data_clean$rental), list(master_data_clean$district), median)
colnames(group_by_district) <- c("District","Avg_rental_price_by_District")
group_by_district <- group_by_district[order(group_by_district$Avg_rental_price_by_District, decreasing = TRUE),]

#reset the row index
rownames(group_by_district) <- 1:24

top_10_district <- head(group_by_district, 10)

#plot the graph: top 10
options(repr.plot.width=15, repr.plot.height=11)
plot1 <- ggplot(data = top_10_district, mapping = aes(x = reorder(District, Avg_rental_price_by_District), y = Avg_rental_price_by_District)) +
  geom_bar(stat = "identity", alpha = .8, size = 1.5, fill = "navyblue") +
  labs(x = "District", y = "Average monthly rental price in (Ringgit Malaysia)") +
  coord_flip() +
  geom_label(mapping = aes(label = round(Avg_rental_price_by_District, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most expensive areas in Kuala Lumpur")
plot1

The most expensive areas in Kuala Lumpur can said to be Mont Kiara and average monthly rental price is around RM 3900.

4.3 Top 10 cheapest areas

Next, we will plot out the 10 most affordable areas by rental price per month:

#plot the last 10
last_10_districts <- tail(group_by_district, 10)

options(repr.plot.width=15, repr.plot.height=11)
plot2 <- ggplot(data = last_10_districts, mapping = aes(x = reorder(District, -Avg_rental_price_by_District), y = Avg_rental_price_by_District)) +
  geom_bar(stat = "identity", alpha = .8, size = 1.5, fill = "#FF6666") +
  labs(x = "District", y = "Average monthly rental price in (Ringgit Malaysia)") +
  coord_flip() +
  geom_label(mapping = aes(label = round(Avg_rental_price_by_District, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most affordable areas in Kuala Lumpur")
plot2

From the above visualization, we can summarize that Setapak and Kepong are the two most affordable areas in Kuala Lumpur.

4.4 Analysis by property size

The built-up size of the houses varies from one building to another. In order to analyse the units, we have categorised them into 6 different general sizes, The distribution of the house sizes can be examined in the donut plot below.

library(ggrepel)
count_by_condo_size = function(min, max){
  count <- master_data_clean %>% filter(builtUp > min & builtUp <=max) %>% nrow()
  count
}

df_builtup_summary <- data.frame(
  category=c("<=600 sq.ft.", "600-900 sq.ft", "900-1200 sq.ft", "1200-1500 sq.ft", "1500-1800 sq.ft", ">1800 sq.ft"),
  count=c(count_by_condo_size(0, 600), 
          count_by_condo_size(600, 900), 
          count_by_condo_size(900, 1200), 
          count_by_condo_size(1200, 1500), 
          count_by_condo_size(1500, 1800), 
          count_by_condo_size(1800, max(master_data_clean$builtUp)))
)

df_builtup_summary$percentage = df_builtup_summary$count / sum(df_builtup_summary$count)* 100
df_builtup_summary$ymax = cumsum(df_builtup_summary$percentage)
df_builtup_summary$ymin = c(0, head(df_builtup_summary$ymax, n = -1))

donut_plot = ggplot(df_builtup_summary, aes(fill = category, ymax = ymax, ymin = ymin, xmax = 100, xmin = 80)) +
  geom_rect(colour = "black") +
  coord_polar(theta = "y") + 
  xlim(c(0, 100)) +
  geom_label_repel(aes(label = paste(round(percentage,2),"%"), x = 100, y = (ymin + ymax)/2),inherit.aes = F, show.legend = F, size = 5)+
  theme(legend.title = element_text(colour = "black", size = 10, face = "bold"), 
        legend.text = element_text(colour = "black", size = 10), 
        panel.grid = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank()) +
  annotate("text", x = 0, y = 0, size = 10, label = "Built-up Size")
donut_plot

From the above chart, we can conclude that the most common types of unit are sized between 600-900sq.ft, 900-1200 sq.ft and 1200-1500sq.ft, which is about 77.36% of total units.

4.5 Average price per built-up size

The rental price ranking in section 4.2 and 4.3 does not take the built-up size(square feet) into account. This section will plot the average price per unit square feet, with minimum and maximum on a ribbon chart.

# min, max, median
summary_df <- master_data_clean %>% 
  filter(pricePerSizeUnitBuiltUp < 10) %>%
  group_by(district) %>%
  summarize(Min = min(pricePerSizeUnitBuiltUp),
    Avg = median(pricePerSizeUnitBuiltUp), 
            Max = max(pricePerSizeUnitBuiltUp)) 
summary_df <- summary_df %>% arrange(Max)

# make district an ordered factor
summary_df$district <- factor(summary_df$district, levels = summary_df$district)

p1 <- ggplot(summary_df, mapping = aes(district, Avg, group = 1)) +
  geom_ribbon(aes(ymin=Min,ymax=Max),color="yellow",fill = "yellow",alpha=0.5) + 
  geom_line(color="#69b3a2") +
  labs(x = "District", y = "Price per built up sq. ft (RM)", 
       title = "Average price per built up size (with max and min price)") +
  theme(axis.text.x=element_text(angle = 90, hjust = 0))
  
print(p1)

From the above ribbon chart, we can examine the relationship between built-up size and rental price per square feet. The blue line is average rental per built-up size, and the yellow area depicts max and min prices. Some of the districts seems to have uniform distribution between the prices, but some districts has extreme price points. This could be due to the location, facilities, and many other factors.

4.6 Correlation

To calculate the correlation among the variables, we will drop the first column propertyName, then convert factors to their numbers by applying unclass.

master_data_clean_cor <- master_data_clean %>% select(rental, pricePerSizeUnitBuiltUp, lat, lng, bedrooms, bathrooms, builtUp, carParks, furnishing, buildingId, unitType, propertyType, district)

master_data_clean_cor <- sapply(master_data_clean_cor, unclass) 

correlations <- cor(master_data_clean_cor)

Correlation Plot

corrplot(correlations, type="upper", order="hclust")

From this matrix, we can conclude that the the main features that might influence rental price are: pricePerSizeUnitBuiltUp, carParks and bathrooms.

Correlation Matrix

Here’s the full summary of correlation matrix among the features:

print(correlations)
##                             rental pricePerSizeUnitBuiltUp         lat
## rental                   1.0000000              0.37336359  0.01984176
## pricePerSizeUnitBuiltUp  0.3733636              1.00000000 -0.02292580
## lat                      0.0198418             -0.02292580  1.00000000
## lng                     -0.1080804             -0.02566229 -0.06899407
## bedrooms                 0.1498092             -0.04019881  0.02210164
## bathrooms                0.5930987             -0.12115902 -0.03064740
## builtUp                  0.0709167             -0.01927450  0.04443338
## carParks                 0.3236340             -0.05456600  0.01624431
## furnishing              -0.1164556             -0.19106246  0.09729998
## buildingId              -0.1810985              0.00710752  0.09962887
## unitType                -0.1544324             -0.00104958  0.02960520
## propertyType            -0.1227679              0.16640108  0.18100744
## district                -0.1485355             -0.11711470  0.00730706
##                                lng     bedrooms  bathrooms    builtUp
## rental                  -0.1080804  1.49809e-01  0.5930987  0.0709167
## pricePerSizeUnitBuiltUp -0.0256623 -4.01988e-02 -0.1211590 -0.0192745
## lat                     -0.0689941  2.21016e-02 -0.0306474  0.0444334
## lng                      1.0000000 -6.47466e-02 -0.0636709 -0.0242899
## bedrooms                -0.0647466  1.00000e+00  0.2936839  0.0279011
## bathrooms               -0.0636709  2.93684e-01  1.0000000  0.0853851
## builtUp                 -0.0242899  2.79011e-02  0.0853851  1.0000000
## carParks                -0.1219492  2.78792e-01  0.4357333  0.0719533
## furnishing               0.0257199  8.60426e-03  0.0479502  0.0485018
## buildingId               0.0790843 -1.26791e-01 -0.1312289  0.0109746
## unitType                 0.0296154 -7.10433e-05 -0.1341334 -0.0031640
## propertyType            -0.0430179 -1.20938e-01 -0.3226945 -0.0148999
## district                -0.0295029  5.89924e-02  0.0421573 -0.0375064
##                           carParks  furnishing  buildingId     unitType
## rental                   0.3236340 -0.11645559 -0.18109846 -1.54432e-01
## pricePerSizeUnitBuiltUp -0.0545660 -0.19106246  0.00710752 -1.04958e-03
## lat                      0.0162443  0.09729998  0.09962887  2.96052e-02
## lng                     -0.1219492  0.02571992  0.07908426  2.96154e-02
## bedrooms                 0.2787920  0.00860426 -0.12679051 -7.10433e-05
## bathrooms                0.4357333  0.04795020 -0.13122887 -1.34133e-01
## builtUp                  0.0719533  0.04850179  0.01097463 -3.16400e-03
## carParks                 1.0000000  0.02674755  0.12640943 -5.51677e-02
## furnishing               0.0267475  1.00000000  0.10128448  1.37606e-01
## buildingId               0.1264094  0.10128448  1.00000000  6.55726e-02
## unitType                -0.0551677  0.13760564  0.06557264  1.00000e+00
## propertyType            -0.1606065 -0.07963709  0.30500354  6.60764e-02
## district                -0.0791855  0.07043111 -0.03447826  7.62658e-03
##                         propertyType    district
## rental                    -0.1227679 -0.14853551
## pricePerSizeUnitBuiltUp    0.1664011 -0.11711470
## lat                        0.1810074  0.00730706
## lng                       -0.0430179 -0.02950288
## bedrooms                  -0.1209377  0.05899242
## bathrooms                 -0.3226945  0.04215730
## builtUp                   -0.0148999 -0.03750641
## carParks                  -0.1606065 -0.07918547
## furnishing                -0.0796371  0.07043111
## buildingId                 0.3050035 -0.03447826
## unitType                   0.0660764  0.00762658
## propertyType               1.0000000 -0.04668294
## district                  -0.0466829  1.00000000

5. Classification

The objective of this section is to classify whether a house is expensive or cheap. We are going to take the median value instead of mean (because our dataset is not evenly distributed).

First, we will create a new target variale called “rental_class”. 0 indicates below average and 1 indicates above average.

median_rental <- median(master_data_clean$rental)
master_data_clean$rental_class <- ifelse(master_data_clean$rental > median_rental, 1, 0)
print(median_rental)
## [1] 2000

From our dataset, the we take the average monthly rental price of RM 2000, perform Random Forest classificasion:

master_data_analysis <- master_data_clean[3:15]
master_data_analysis$rental_class <- as.character(master_data_analysis$rental_class)
master_data_analysis$rental_class <- as.factor(master_data_analysis$rental_class)
rf_model <- randomForest(rental_class ~ ., data=master_data_analysis, proximity=TRUE)
varImpPlot(rf_model, main="Random Forest Model Classification")

Based on the MeanDecreaseGini chart above, we can conclude that the important features for classification are: builtUp size, pricePerSizeUnitBuiltUp, district, bedrooms and bathrooms.

Next Let’s print out the confusion matrix of our RF model.

rf_model
## 
## Call:
##  randomForest(formula = rental_class ~ ., data = master_data_analysis,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 2.44%
## Confusion matrix:
##      0    1 class.error
## 0 2374   51   0.0210309
## 1   60 2064   0.0282486

The total number of decision trees created in this model is 500, the error rate is 2.44% which is considered quite low.

6. Predictive Analysis

6.1 Grouping datasets

Objective: To build a linear model to predict house price.

Target Variable: rental
First predictor: bathrooms
Second predictor: pricePerSizeUnitBuiltUp

Let’s calculate correlation coefficients between target variables and the two predictors for 4 different datasets that we have sampled:

Dataset 1

Dataset 1 is a collection of properties from 2 areas: Batu Caves, Bukit Jalil and Bukit Bintang.

new_dataset1 <- master_data_clean %>% 
  filter(district == 'Batu Caves'|district == 'Bukit Jalil'|district == 'Bukit Bintang')
dataset1_corb <- cor.test(new_dataset1$rental,new_dataset1$bathrooms,method = "pearson")
dataset1_corp <- cor.test(new_dataset1$rental,new_dataset1$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset1_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.301215"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset1_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.756856"
new_dataset1 <- new_dataset1 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset1[,c(2,3,4)],histogram=TRUE,pch="19")

Dataset 2

Dataset 2 consists of properties from: Batu Caves, Cheras, Kuchai Lama, Sri Hartamas and Sri Petaling.

new_dataset2 <- master_data_clean %>% 
  filter(district == 'Batu Caves'|district == 'Cheras'|district == 'Kuchai Lama'|district == 'Sri Hartamas'|district == 'Sri Petaling')
dataset2_corb <- cor.test(new_dataset2$rental,new_dataset2$bathrooms,method = "pearson")
dataset2_corp <- cor.test(new_dataset2$rental,new_dataset2$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset2_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.641577"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset2_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.278929"
new_dataset2 <- new_dataset2 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset2[,c(2,3,4)],histogram=TRUE,pch="19")

Dataset 3

Dataset 3 is a compilation of properties from Sri Hartamas, Bangsar and Mont Kiara

new_dataset3 <- master_data_clean %>% 
  filter(district == 'Sri Hartamas'| district == 'Bangsar'|district == 'Mont Kiara')
dataset3_corb <- cor.test(new_dataset3$rental,new_dataset3$bathrooms,method = "pearson")
dataset3_corp <- cor.test(new_dataset3$rental,new_dataset3$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset3_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.886179"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset3_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: -0.021008"
new_dataset3 <- new_dataset3 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset3[,c(2,3,4)],histogram=TRUE,pch="19")

Dataset 4

Dataset 4 is a set of units from Batu Caves, Cheras, Kuchai Lama, Sri Hartamas and Sri Petaling.

new_dataset4 <- master_data_clean %>% 
  filter(district == 'Batu Caves'|district == 'Cheras'|district == 'Kuchai Lama'|district == 'Sri Hartamas'|district == 'Sri Petaling')
dataset4_corb <- cor.test(new_dataset4$rental,new_dataset4$bathrooms,method = "pearson")
dataset4_corp <- cor.test(new_dataset4$rental,new_dataset4$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset4_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.641577"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset4_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.278929"
new_dataset4 <- new_dataset4 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset4[,c(2,3,4)],histogram=TRUE,pch="19")

6.2 Split train-test samples

From section 6.1, based on the higher correlation values (with each dependent variables), we choose dataset 1 and dataset 3 to continue with linear regression models.

new_dataset <- rbind(new_dataset1,new_dataset3)
dim(new_dataset)
## [1] 1109    4

Next, we are going to split our dataset into training set and test set with 70-30 ratio.

set.seed(123)
split = 0.70 #define an 70%/30% training/testing split of the dataset.
training_set_index <- createDataPartition(new_dataset$district,p=split,list=FALSE)
trainingset <- new_dataset[training_set_index,]
testset <- new_dataset[-training_set_index,]
dim(trainingset)
## [1] 779   4
dim(testset)
## [1] 330   4

We now have 779 rows in training dataset and 330 rows in test dataset.

6.3 Linear Regression Models

We are going to train two linear regression models based on the train-test samples.
* Model 1: has one predictor which is the most correlated column with the #target(i.e bathrooms).
* Model 2: has two predictors, bathrooms and pricePerSizeUnitBuiltUp

Model 1

The first model will use only one predictor.

linear_model_1_predictor<-lm(rental~bathrooms,data=trainingset)
summary(linear_model_1_predictor)
## 
## Call:
## lm(formula = rental ~ bathrooms, data = trainingset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3161  -1106   -106    650   6894 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    193.9      123.7    1.57     0.12    
## bathrooms     1455.8       51.9   28.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1570 on 777 degrees of freedom
## Multiple R-squared:  0.503,  Adjusted R-squared:  0.502 
## F-statistic:  787 on 1 and 777 DF,  p-value: <2e-16
par(mfrow=c(2,2))
plot(linear_model_1_predictor)

Model 2

The second model uses two predictors against target variable “rental”.

linear_model_2_predictor<-lm(rental~pricePerSizeUnitBuiltUp+bathrooms,data=trainingset)
summary(linear_model_2_predictor)
## 
## Call:
## lm(formula = rental ~ pricePerSizeUnitBuiltUp + bathrooms, data = trainingset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3031   -442    -19    421   4404 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -3237.8      127.0   -25.5   <2e-16 ***
## pricePerSizeUnitBuiltUp   1044.7       30.5    34.3   <2e-16 ***
## bathrooms                 1683.7       33.4    50.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 994 on 776 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.802 
## F-statistic: 1.57e+03 on 2 and 776 DF,  p-value: <2e-16
par(mfrow=c(2,2))
plot(linear_model_2_predictor)

6.4 Prediction

In this section, we are going to perform predictions with the models we trained in section 6.3.

#Predictor1
prediction_model1 <- predict(linear_model_1_predictor, testset)

#Predictor2
prediction_model2 <- predict(linear_model_2_predictor, testset)

#plots
par(mfrow=c(2,2))

plot(testset$rental,type="l",lty=1.8,col="red")
lines(prediction_model1,type="l",col="blue")
plot(prediction_model1,type="l",lty=1.8,col="blue")

plot(testset$rental,type="l",lty=1.8,col="red")
lines(prediction_model2,type="l",col="blue")
plot(prediction_model2,type="l",lty=1.8,col="blue")

mtext("Predictions with model 1 (above) and model2 (below)", side = 3, line = -12, outer = TRUE)

6.5 Model Evaluation

Here are two scatter plots of the performance of model 2 over training set vs test set.

plot_predictions <- function(linear_model_predictor, dataset){
  pred = predict(linear_model_predictor,dataset)
  act = dataset$rental
  tst = data.frame(district = dataset$district,prediction = pred,actual = act)
  tst = mutate(tst,delta = prediction - actual)
  tstvar = round(var(tst$delta))
  plot <- ggplot(tst,aes(x=actual,y=prediction,color=district)) + geom_point()  + geom_abline(intercept = 0, slope = 1, linetype="dashed")
  plot
}

p1 <- plot_predictions(linear_model_2_predictor, trainingset)
p2 <- plot_predictions(linear_model_2_predictor, testset)
grid.arrange(p1, p2, ncol=2)

Let’s calculate accuracy of the above two models:

calculate_accuracy_trainset <- function(linear_model, dataset){
  # Fitted values of train data
  pred_1_train = linear_model$fitted.values
  
  # Store actual price and predicted price in a data frame
  actual_fitted = data.frame(actual=dataset$rental,predicted=pred_1_train)
  
  # Mean of absolute difference between predicted values and actual values
  abs_diff = mean(abs(actual_fitted$actual-actual_fitted$predicted)/actual_fitted$actual)
  
  # % accuracy of the prediction with trainigset1
  accuracy_train = 1 - abs_diff
  
  accuracy_train
}

calculate_accuracy_testset <- function(linear_model, dataset){
  pred_1_test=predict(linear_model,dataset)
  
  # Store actual price and predicted price in a data frame
  actual_fitted_test=data.frame(actual=dataset$rental, predicted=pred_1_test)
  
  # Mean of absolute difference between predicted values and actual values
  abs_diff_test = mean(abs(actual_fitted_test$actual-actual_fitted_test$predicted)/actual_fitted_test$actual)
  
  # % accuracy of the prediction with testset1
  accuracy_test = 1-abs_diff_test
  accuracy_test
}

percent <- function(x, digits = 2, format = "f", ...) {
  paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}

m1_train <- calculate_accuracy_trainset(linear_model_1_predictor, trainingset)
m1_test <- calculate_accuracy_testset(linear_model_1_predictor, testset)
m2_train <- calculate_accuracy_trainset(linear_model_2_predictor, trainingset)
m2_test <- calculate_accuracy_testset(linear_model_2_predictor, testset)

Accuracy scores of the two model on trining and test data set as follows:

Accuracy scores of the two models
model train set test set
model 1 predictor 58.65% 77.03%
model 2 predictor 58.56% 77.59%

Next we will compute the kendall’s tau accuracy for the predictions of the two prediction models above, to check how the rankings of the predicted values correlate with the actual target values in the test set.

kt_for_model1_predictor <- cor(prediction_model1,testset$rental,method = "kendall")
kt_for_model2_predictor <- cor(prediction_model2,testset$rental,method = "kendall")
Let’s print out the accuracy scores:
Kendall’s Tau Correlation Coefficient
model Kendall’s Tau Accuracy
model1 predictor 47.95%
model2 predictor 71.26%

Based on Kendalls tau correlations, we can conclude that the model with two predictors perform better.

6.6 Perform Significance Test

For significance testing, we will run the models 50 times. Each time, we perform predictions on the test set and compute the Kendall’s tau correlation. The Kendall’s tau correlations for the models are kept separately in data two data frames.

dataframe_prediction_one_predictor<-data.frame(prediction_one_predictor=double())
dataframe_prediction_two_predictor<-data.frame(prediction_two_predictor=double())
for (i in 1:50){
  
  linear_model_1_predictor<-lm(rental~bathrooms,data=trainingset)
  linear_model_2_predictor<-lm(rental~pricePerSizeUnitBuiltUp+bathrooms,data=trainingset)
  prediction_for_model_with_1_predictor<-predict(linear_model_1_predictor,testset)
  prediction_for_model_with_2_predictors<-predict(linear_model_2_predictor,testset)
  #Compute the kendall’s tau correlations for the predictions of the models.
  
  newRow_1_predictor<-data.frame(prediction_one_predictor=cor(prediction_for_model_with_1_predictor,testset$rental,method = "kendall"))
  dataframe_prediction_one_predictor<-rbind(dataframe_prediction_one_predictor,newRow_1_predictor)
  newRow_2_predictor<-data.frame(prediction_two_predictor=cor(prediction_for_model_with_2_predictors,testset$rental,method = "kendall"))
  dataframe_prediction_two_predictor<-rbind(dataframe_prediction_two_predictor,newRow_2_predictor)
}

Next, we are going to conduct nonparametric test due to the population sample size is small.

qqnorm(dataframe_prediction_two_predictor$prediction_two_predictor, pch = 1, frame = FALSE)
qqline(dataframe_prediction_two_predictor$prediction_two_predictor, col = "steelblue", lwd = 2)

A nonparametric test is obtained because qqplot shows the graph is not normally distributed. Next, we perform Wilcoxon test for significance using both data frames.

wilcox.test(dataframe_prediction_one_predictor$prediction_one_predictor,dataframe_prediction_two_predictor$prediction_two_predictor,exact = FALSE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  dataframe_prediction_one_predictor$prediction_one_predictor and dataframe_prediction_two_predictor$prediction_two_predictor
## W = 0, p-value <2e-16
## alternative hypothesis: true location shift is not equal to 0

7. Conclusion

To summarise the linear regression in section 6,
  1. In section 6.2, the Normal Q-Q plot for second linear regression model(ii) looks much better than the first linear regression model(i).
  2. Second model is an improvement. Median residual error is -19, which is better than -106 from the first model.
  3. In section 6.3, result of second regression model(two predictors) with test model show similar trend and it is better than first regression model(one predictor) with test model.
  4. Kendalls tau accuracy score for model2 (71.26%) is higher than model1(47.95%).
  5. Wilcoxon Signed-Rank test shows the p-value 2e-16 is way lesser than 0.05, we can conclude that there is a significant difference between the performance of the two machine learning models.
  1. Also, model with 2 predictors is significantly better than 1 predictor, which is significant and not by chance.