The aim of our group project is to analyze the latest trends in Kuala Lumpur’s house rental market. There aren’t any dataset with latest rental data are available on the web, so we have chosen to scrape from one of the most popular property portal in Malaysia: www.iproperty.com.my. Our team has decided to perform raw collection on 16th May 2021, for the following 24 areas in Kuala Lumpur, we managed to collect 4759 data rows and 18 variables. You can access the dataset from our github repo.
We have selected 24 areas in Kuala Lumpur:
| Ampang | Bukit Jalil | Mont Kiara | Sri Petaling |
| Bangsar | Cheras | Segambut | Sungai Besi |
| Bangsar South | Damansara Heights | Sentul | Taman Desa |
| Batu Caves | Kepong | Seputeh | TitiWangsa |
| Brickfields | KLCC | Setapak | Taman Tun Dr Ismail |
| Bukit Bintang | Kuchai Lama | Sri Hartamas | Wangsa Maju |
The details of the 18 features are as follows:
[1] name: full name with area
[2] propertyName: the name of property
[3] rental: monthly rental price
[4] pricePerSizeUnitBuiltUp: rental price per built up size
[5] address: address of the property
[6] lat: latitude
[7] lng: longitude
[8] city: city name
[9] district: district name
[10] bedrooms: total number of bedrooms
[11] bathrooms: total number of bathrooms
[12] builtUp: built up size in square feet
[13] carParks: total number of car parks
[14] furnishing: furnishing type - fully/partially furnished or unfurnished
[15] postedDate: the date of the ad posted on the site
[16] buildingId: unique ID of the building
[17] unitType: unit type, for example, corner lot
[18] propertyType: property type, condo or serviced apartment
R package used for our project are as follows:
#web-scraping
library(selectr)
library(xml2)
library(rvest)
library(jsonlite)
#data pre-processing
library(plyr)
library(dplyr)
library(tidyverse)
library(Amelia)
#Charts
library(ggplot2)
library(corrplot)
#display/formatting
library(knitr)
library(gridExtra)
library(kableExtra)
#Linear Regression/Classification
library(PerformanceAnalytics)
library(caret)
library(rmarkdown)
library(randomForest)
We have initially planned to use rvest package to scrape from the website, but the site seems to ban the user-agent of web scrapers, it returns “ERROR 429”. So, we downloaded the HTML pages manually. After checking the source code, we found the JSON in their HTML source. We used xml2 package with rvest and selectr to extract the dataset we need. First, we extract all JSON files using jsonlite, then export to csv.
setwd('/Users/n2n/RWorkingDirectory/WQD7004_group_project_git/')
pagesNum <- c(1:10)
inputFileNamePrefix <- "iprop_wangsa_maju"
outputCsvFileName <- paste0(inputFileNamePrefix, "_df.csv")
#create a function to extract json, and write to file
extractJsonFromPage <- function(inputFileName, outputFileName){
iprop_html <- read_html(inputFileName)
script_nodes <- tail(iprop_html %>%
html_nodes("script"))
pop <- html_text(script_nodes[[5]])
startIndex <- gregexpr('asyncData', pop)
endIndex <- gregexpr('false}}', pop)
jsonString <- substr(pop, startIndex[[1]] - 2, endIndex[[1]] + 6)
writeLines(jsonString, outputFileName)
}
for(i in pagesNum){
inputFileName <- paste0(inputFileNamePrefix, i, ".html")
outputFileName <- paste0(inputFileNamePrefix, i, ".json")
extractJsonFromPage(inputFileName, outputFileName)
}
#------------------------------
#read JSON file and convert to csv
#------------------------------
#create an empty dataframe with 14 column and add column names
df <- data.frame(matrix(ncol = 14, nrow = 0), stringsAsFactors = FALSE)
#read json file
convertJsonToList <- function(inputFileName){
propJson <- read_json(inputFileName)
listings <- (propJson['listings'])[[1]]
listingItems <- (listings['items'])[[1]]
mylist <- list() #create an empty list
for(i in 1:length(listingItems)){
listingItem1 <- listingItems[[i]]
fullName <- listingItem1$title
rental <- listingItem1$prices[[1]]$min
address <- listingItem1$address$formattedAddress
lat <- listingItem1$address$lat
lng <- listingItem1$address$lng
city <- listingItem1$multilanguagePlace$`en-GB`$level1
district <- listingItem1$multilanguagePlace$`en-GB`$level2
propertyName <- listingItem1$multilanguagePlace$`en-GB`$level3
bedrooms <- listingItem1$attributes$bedroom
bathrooms <- listingItem1$attributes$bathroom
builtUp <- listingItem1$attributes$builtUp
carParks <- listingItem1$attributes$carPark
furnishing <- listingItem1$attributes$furnishing
pricePerSizeUnitBuiltUp <- listingItem1$attributes$pricePerSizeUnitBuiltUp
buildingId <- listingItem1$attributes$buildingId
unitType <- listingItem1$attributes$unitType
postedDate <- listingItem1$postedAt
propertyType <- listingItem1$propertyType
mylist[[i]] <- list(fullName, propertyName, rental, pricePerSizeUnitBuiltUp, address,
lat, lng, city, district, bedrooms, bathrooms, builtUp, carParks,
furnishing, postedDate, buildingId, unitType, propertyType)
}
return(mylist)
}
myDfList <- list()
for(i in 1:10){
inputFileName <- paste0(inputFileNamePrefix, i , ".json")
myDfList[[i]] <- convertJsonToList(inputFileName)
}
mainDf <- data.frame(matrix(ncol = 14, nrow = 0), stringsAsFactors = FALSE)
for(i in 1:10){
temp_df <- do.call("rbind", myDfList[[i]])
mainDf <- rbind(mainDf, temp_df)
}
titles <- c("name", "propertyName", "rental","pricePerSizeUnitBuiltUp", "address",
"lat", "lng", "city", "district", "bedrooms", "bathrooms", "builtUp", "carParks",
"furnishing", "postedDate", "buildingId", "unitType", "propertyType")
# First coerce the data.frame to all-character
mainDf <- data.frame(lapply(mainDf, as.character), stringsAsFactors=FALSE)
write.table(mainDf, file = outputCsvFileName, sep = ",", col.names = titles, row.names = FALSE)
We extracted JSON from HTML, and converted the json files to csv. All the 24 csv files under the directory were merged using ldply function as follow:
setwd("/Users/n2n/RWorkingDirectory/WQD7004/merged_files")
dataset <- ldply(list.files(), read.csv, header=TRUE)
titles <- c("name", "propertyName", "rental","pricePerSizeUnitBuiltUp", "address",
"lat", "lng", "city", "district", "bedrooms", "bathrooms", "builtUp", "carParks",
"furnishing", "postedDate", "buildingId", "unitType", "propertyType")
write.table(dataset, file = "iprop_master_dataframe.csv", sep = ",", col.names = titles, row.names = FALSE)
We will import the csv into master dataframe, and we call it “master_data”. Subsequent analysis will be based on this dataframe.
master_data <- read.csv("/Users/n2n/RWorkingDirectory/WQD7004_group_project_git/iprop_master_dataframe.csv")
The dimension of the original dataset:
#dimension of the data frame
dim(master_data)
## [1] 4759 18
We use the package paged_table to display the first 20 rows from the dataset:
#First 10 rows of the dataset
paged_table(head(master_data,20))
# Determine the structure of the dataset
str(master_data)
## 'data.frame': 4759 obs. of 18 variables:
## $ name : chr "M City, Ampang" "M City, Ampang" "M City, Ampang" "M City, Ampang" ...
## $ propertyName : chr "M City" "M City" "M City" "M City" ...
## $ rental : int 1300 1550 1800 2200 1350 1600 2200 1550 1650 2050 ...
## $ pricePerSizeUnitBuiltUp: chr "2.44" "1.98" "2.3" "2.25" ...
## $ address : chr "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" "Jalan Ampang, Ampang, 50450, Kuala Lumpur" ...
## $ lat : chr "3.15934" "3.15934" "3.15934" "3.15934" ...
## $ lng : chr "101.746861" "101.746861" "101.746861" "101.746861" ...
## $ city : chr "Kuala Lumpur" "Kuala Lumpur" "Kuala Lumpur" "Kuala Lumpur" ...
## $ district : chr "Ampang" "Ampang" "Ampang" "Ampang" ...
## $ bedrooms : chr "Studio" "1" "1" "2" ...
## $ bathrooms : chr "1" "1" "1" "2" ...
## $ builtUp : chr "533" "781" "781" "977" ...
## $ carParks : chr "1" "1" "1" "2" ...
## $ furnishing : chr "Fully furnished" "Fully furnished" "Fully furnished" "Fully furnished" ...
## $ postedDate : chr "2021-05-13T05:37:19.004Z" "2021-05-13T05:35:23.069Z" "2021-05-13T05:33:01.874Z" "2021-05-13T05:31:28.698Z" ...
## $ buildingId : chr "2247" "2247" "2247" "2247" ...
## $ unitType : chr "NULL" "NULL" "NULL" "Intermediate" ...
## $ propertyType : chr "Serviced Residence" "Serviced Residence" "Serviced Residence" "Serviced Residence" ...
# Summary of data
summary(master_data)
## name propertyName rental
## Length:4759 Length:4759 Min. : 400
## Class :character Class :character 1st Qu.: 1600
## Mode :character Mode :character Median : 2000
## Mean : 4089
## 3rd Qu.: 2900
## Max. :1900000
## pricePerSizeUnitBuiltUp address lat
## Length:4759 Length:4759 Length:4759
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## lng city district bedrooms
## Length:4759 Length:4759 Length:4759 Length:4759
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## bathrooms builtUp carParks furnishing
## Length:4759 Length:4759 Length:4759 Length:4759
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## postedDate buildingId unitType propertyType
## Length:4759 Length:4759 Length:4759 Length:4759
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
First, we are going remove the rows that doesn’t have a propertyName, there are a total of 20 rows. We also noticed that there is one record of propertype Apartment, since our study only focused on Condo and Serviced Apartments, we are going to remove this record too.
sprintf("Number of listings without a name: %s", count(master_data[master_data$propertyName == 'NULL',]))
## [1] "Number of listings without a name: 20"
master_data <- master_data %>% filter(propertyName != 'NULL')
master_data <- master_data %>% filter(!(propertyType == 'Apartment'))
count(master_data) #4738 records
## n
## 1 4738
Based on the structure, all columns are characters except rental column. So, we need to convert the data types first.
master_data$lat <- as.numeric(master_data$lat, options(digits = 6))
master_data$lng <- as.numeric(master_data$lng,options(digits = 6))
master_data$bathrooms <-as.numeric(master_data$bathrooms)
master_data$builtUp <-as.numeric(gsub(",","",master_data$builtUp))
master_data$carParks <-as.numeric(master_data$carParks)
master_data$postedDate <-as.Date(master_data$postedDate)
master_data$pricePerSizeUnitBuiltUp<-as.numeric(master_data$pricePerSizeUnitBuiltUp)
master_data$buildingId <-as.numeric(master_data$buildingId)
Total number of rows with missing values
sum(is.na(master_data))
## [1] 598
Let’s check where are the missing value located:
colSums(is.na(master_data))
## name propertyName rental
## 0 0 0
## pricePerSizeUnitBuiltUp address lat
## 160 0 1
## lng city district
## 1 0 0
## bedrooms bathrooms builtUp
## 0 1 15
## carParks furnishing postedDate
## 420 0 0
## buildingId unitType propertyType
## 0 0 0
Next, we draw a missingness map to visualise:
missmap(master_data)
We round up the number of carpark and bathroom rounded up due to they should be as a whole number and we are going to ignore lat and lng due to it would not affect our analysis.
master_data_imputed <- master_data %>%
group_by(district) %>%
mutate(builtUp = ifelse(is.na(builtUp),mean(builtUp,na.rm=TRUE),builtUp),
carParks = ifelse(is.na(carParks),round(mean(carParks,na.rm=TRUE),digits = 0),carParks),
bathrooms = ifelse(is.na(bathrooms),round(mean(bathrooms,na.rm=TRUE),digits = 0),bathrooms),
lat = ifelse(is.na(lat),round(mean(lat,na.rm=TRUE),digits = 0),lat),
lng = ifelse(is.na(lng),round(mean(lng,na.rm=TRUE),digits = 0),lng),
pricePerSizeUnitBuiltUp = ifelse(is.na(pricePerSizeUnitBuiltUp),round(mean(pricePerSizeUnitBuiltUp,na.rm=TRUE),digits = 0),pricePerSizeUnitBuiltUp))
Let’s verify the missing values again
colSums(is.na(master_data_imputed))
## name propertyName rental
## 0 0 0
## pricePerSizeUnitBuiltUp address lat
## 0 0 0
## lng city district
## 0 0 0
## bedrooms bathrooms builtUp
## 0 0 0
## carParks furnishing postedDate
## 0 0 0
## buildingId unitType propertyType
## 0 0 0
We are going to detect outliers in the based on the percentiles. With the percentiles method, all observations that lie outside the interval formed by the 2.5 and 97.5 percentiles will be considered as potential outliers. Other percentiles such as the 1 and 99, or the 5 and 95 percentiles can also be considered to construct the interval.
Based on the summary of data, we can observed that the main feature (i.e. rental) has a huge range. Let’s check the summary and boxplot:
summary(master_data_imputed$rental)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 400 1600 2000 4050 2900 1900000
Boxplot of rental price
# Store the graph
box_plot <- ggplot(master_data_imputed, aes(y = rental))
# Add the geometric object box plot
box_plot +
geom_boxplot() +coord_flip()+ggtitle("Overall Price Boxplot")
Subsequently, we are going to identify the potential outliers in the column rental price with Percentile methods.
lower_bound <- quantile(master_data_imputed$rental, 0.025)
upper_bound <- quantile(master_data_imputed$rental, 0.975)
outlier_ind <- master_data_imputed %>%
filter(rental < quantile(master_data_imputed$rental, 0.025) | rental > quantile(master_data_imputed$rental, 0.975))
# A total 235 record found as outliers with the range of rental between 1150 and 12000
## exclude outlier from the data
master_data_clean <- master_data_imputed %>%
filter(rental <= quantile(master_data$rental, 0.975) & rental >= quantile(master_data$rental, 0.025))
We will examine the summary of rental price again after removing the outliers with percentile method.
## Summary of rental after removing outliers
summary(master_data_clean$rental)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1100 1600 2000 2588 2800 11000
We noticed outliers in carParks column too. We will replacing those values with 1.
master_data_clean[master_data_clean$carParks > 20, "carParks"] <- 1
Furthermore, we noticed that the column “furnishing” is not standardized when we check for the unique values.
unique(master_data_clean$furnishing)
## [1] "Fully furnished" "NULL" "Partly furnished" "Fully Furnished"
## [5] "Unfurnished" "Partly Furnished"
Hence, we need to do standardization for furnishing column.
master_data_clean <- master_data_clean %>%
mutate(furnishing = ifelse(furnishing == "Fully furnished" | furnishing == "Fully Furnished", "Fully Furnished",
ifelse(furnishing == "Partly furnished" | furnishing == "Partly Furnished", "Partly Furnished",
ifelse(furnishing == "NULL", "Unknown",
furnishing))))
Let’s check the values again after standardization:
unique(master_data_clean$furnishing)
## [1] "Fully Furnished" "Unknown" "Partly Furnished" "Unfurnished"
We noticed that bedrooms, furnishing types, property type, unit type and districts columns needed to be converted to factors. We will do this conversion using dplyr package:
master_data_clean <- master_data_clean %>%
mutate(furnishing=factor(furnishing)) %>%
mutate(bedrooms=factor(bedrooms)) %>%
mutate(propertyType=factor(propertyType)) %>%
mutate(unitType=factor(unitType)) %>%
mutate(district=factor(district))
There are total 18 features in the dataset that we have curated, we will drop some of the features that we will not be using for analysis:
master_data_clean <- master_data_clean %>%
select(propertyName, rental, pricePerSizeUnitBuiltUp, lat, lng, district, bedrooms, bathrooms, builtUp, carParks, furnishing, buildingId, unitType, propertyType)
By now, we have handled all the data pre-processing steps. And the dataframe dimension after removing outliers is 4549 observations with 14 features.
summary(master_data_clean)
## propertyName rental pricePerSizeUnitBuiltUp lat
## Length:4549 Min. : 1100 Min. : 0.78 Min. :3.00
## Class :character 1st Qu.: 1600 1st Qu.: 1.71 1st Qu.:3.11
## Mode :character Median : 2000 Median : 2.14 Median :3.15
## Mean : 2588 Mean : 2.40 Mean :3.14
## 3rd Qu.: 2800 3rd Qu.: 2.80 3rd Qu.:3.18
## Max. :11000 Max. :38.00 Max. :3.25
##
## lng district bedrooms
## Min. :102 Wangsa Maju : 209 3 :1778
## 1st Qu.:102 Kampung Kerinchi (Bangsar South): 200 2 : 945
## Median :102 Segambut : 200 3+1 : 559
## Mean :102 Sungai Besi : 199 1 : 342
## 3rd Qu.:102 Taman Desa : 199 Studio : 271
## Max. :102 Brickfields : 198 4 : 216
## (Other) :3344 (Other): 438
## bathrooms builtUp carParks furnishing
## Min. :1.00 Min. : 50 Min. :1.00 Fully Furnished :2609
## 1st Qu.:2.00 1st Qu.: 823 1st Qu.:1.00 Partly Furnished:1193
## Median :2.00 Median : 1021 Median :1.00 Unknown : 684
## Mean :2.13 Mean : 1250 Mean :1.46 Unfurnished : 63
## 3rd Qu.:2.00 3rd Qu.: 1300 3rd Qu.:2.00
## Max. :7.00 Max. :184368 Max. :4.00
##
## buildingId unitType propertyType
## Min. : 6 NULL :2610 Condominium :2225
## 1st Qu.:1737 Intermediate:1068 Serviced Residence:2324
## Median :2857 Corner lot : 733
## Mean :3072 Duplex : 40
## 3rd Qu.:4195 End lot : 38
## Max. :7013 Studio : 28
## (Other) : 32
paged_table(master_data_clean)
# Store the graph
box_plot <- ggplot(master_data_clean, aes(x = district,y = rental))
# Add the geometric object box plot
box_plot +
geom_boxplot() +coord_flip()+ggtitle("Boxplot of Rental by District")
Usually, mean value will be taken to plot the average. But in this case, the data is not evenly distributed, and a lot of outliers detected from the boxplot in section 4.1. Hence, it is more reasonable to take the median value for comparison instead of mean.
group_by_district <- aggregate(list(master_data_clean$rental), list(master_data_clean$district), median)
colnames(group_by_district) <- c("District","Avg_rental_price_by_District")
group_by_district <- group_by_district[order(group_by_district$Avg_rental_price_by_District, decreasing = TRUE),]
#reset the row index
rownames(group_by_district) <- 1:24
top_10_district <- head(group_by_district, 10)
#plot the graph: top 10
options(repr.plot.width=15, repr.plot.height=11)
plot1 <- ggplot(data = top_10_district, mapping = aes(x = reorder(District, Avg_rental_price_by_District), y = Avg_rental_price_by_District)) +
geom_bar(stat = "identity", alpha = .8, size = 1.5, fill = "navyblue") +
labs(x = "District", y = "Average monthly rental price in (Ringgit Malaysia)") +
coord_flip() +
geom_label(mapping = aes(label = round(Avg_rental_price_by_District, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most expensive areas in Kuala Lumpur")
plot1
The most expensive areas in Kuala Lumpur can said to be Mont Kiara and average monthly rental price is around RM 3900.
Next, we will plot out the 10 most affordable areas by rental price per month:
#plot the last 10
last_10_districts <- tail(group_by_district, 10)
options(repr.plot.width=15, repr.plot.height=11)
plot2 <- ggplot(data = last_10_districts, mapping = aes(x = reorder(District, -Avg_rental_price_by_District), y = Avg_rental_price_by_District)) +
geom_bar(stat = "identity", alpha = .8, size = 1.5, fill = "#FF6666") +
labs(x = "District", y = "Average monthly rental price in (Ringgit Malaysia)") +
coord_flip() +
geom_label(mapping = aes(label = round(Avg_rental_price_by_District, 1)), size = 4, fill = "#F5FFFA", fontface = "bold") + ggtitle("Top 10 most affordable areas in Kuala Lumpur")
plot2
From the above visualization, we can summarize that Setapak and Kepong are the two most affordable areas in Kuala Lumpur.
The built-up size of the houses varies from one building to another. In order to analyse the units, we have categorised them into 6 different general sizes, The distribution of the house sizes can be examined in the donut plot below.
library(ggrepel)
count_by_condo_size = function(min, max){
count <- master_data_clean %>% filter(builtUp > min & builtUp <=max) %>% nrow()
count
}
df_builtup_summary <- data.frame(
category=c("<=600 sq.ft.", "600-900 sq.ft", "900-1200 sq.ft", "1200-1500 sq.ft", "1500-1800 sq.ft", ">1800 sq.ft"),
count=c(count_by_condo_size(0, 600),
count_by_condo_size(600, 900),
count_by_condo_size(900, 1200),
count_by_condo_size(1200, 1500),
count_by_condo_size(1500, 1800),
count_by_condo_size(1800, max(master_data_clean$builtUp)))
)
df_builtup_summary$percentage = df_builtup_summary$count / sum(df_builtup_summary$count)* 100
df_builtup_summary$ymax = cumsum(df_builtup_summary$percentage)
df_builtup_summary$ymin = c(0, head(df_builtup_summary$ymax, n = -1))
donut_plot = ggplot(df_builtup_summary, aes(fill = category, ymax = ymax, ymin = ymin, xmax = 100, xmin = 80)) +
geom_rect(colour = "black") +
coord_polar(theta = "y") +
xlim(c(0, 100)) +
geom_label_repel(aes(label = paste(round(percentage,2),"%"), x = 100, y = (ymin + ymax)/2),inherit.aes = F, show.legend = F, size = 5)+
theme(legend.title = element_text(colour = "black", size = 10, face = "bold"),
legend.text = element_text(colour = "black", size = 10),
panel.grid = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank()) +
annotate("text", x = 0, y = 0, size = 10, label = "Built-up Size")
donut_plot
From the above chart, we can conclude that the most common types of unit are sized between 600-900sq.ft, 900-1200 sq.ft and 1200-1500sq.ft, which is about 77.36% of total units.
The rental price ranking in section 4.2 and 4.3 does not take the built-up size(square feet) into account. This section will plot the average price per unit square feet, with minimum and maximum on a ribbon chart.
# min, max, median
summary_df <- master_data_clean %>%
filter(pricePerSizeUnitBuiltUp < 10) %>%
group_by(district) %>%
summarize(Min = min(pricePerSizeUnitBuiltUp),
Avg = median(pricePerSizeUnitBuiltUp),
Max = max(pricePerSizeUnitBuiltUp))
summary_df <- summary_df %>% arrange(Max)
# make district an ordered factor
summary_df$district <- factor(summary_df$district, levels = summary_df$district)
p1 <- ggplot(summary_df, mapping = aes(district, Avg, group = 1)) +
geom_ribbon(aes(ymin=Min,ymax=Max),color="yellow",fill = "yellow",alpha=0.5) +
geom_line(color="#69b3a2") +
labs(x = "District", y = "Price per built up sq. ft (RM)",
title = "Average price per built up size (with max and min price)") +
theme(axis.text.x=element_text(angle = 90, hjust = 0))
print(p1)
From the above ribbon chart, we can examine the relationship between built-up size and rental price per square feet. The blue line is average rental per built-up size, and the yellow area depicts max and min prices. Some of the districts seems to have uniform distribution between the prices, but some districts has extreme price points. This could be due to the location, facilities, and many other factors.
To calculate the correlation among the variables, we will drop the first column propertyName, then convert factors to their numbers by applying unclass.
master_data_clean_cor <- master_data_clean %>% select(rental, pricePerSizeUnitBuiltUp, lat, lng, bedrooms, bathrooms, builtUp, carParks, furnishing, buildingId, unitType, propertyType, district)
master_data_clean_cor <- sapply(master_data_clean_cor, unclass)
correlations <- cor(master_data_clean_cor)
corrplot(correlations, type="upper", order="hclust")
From this matrix, we can conclude that the the main features that might influence rental price are: pricePerSizeUnitBuiltUp, carParks and bathrooms.
Here’s the full summary of correlation matrix among the features:
print(correlations)
## rental pricePerSizeUnitBuiltUp lat
## rental 1.0000000 0.37336359 0.01984176
## pricePerSizeUnitBuiltUp 0.3733636 1.00000000 -0.02292580
## lat 0.0198418 -0.02292580 1.00000000
## lng -0.1080804 -0.02566229 -0.06899407
## bedrooms 0.1498092 -0.04019881 0.02210164
## bathrooms 0.5930987 -0.12115902 -0.03064740
## builtUp 0.0709167 -0.01927450 0.04443338
## carParks 0.3236340 -0.05456600 0.01624431
## furnishing -0.1164556 -0.19106246 0.09729998
## buildingId -0.1810985 0.00710752 0.09962887
## unitType -0.1544324 -0.00104958 0.02960520
## propertyType -0.1227679 0.16640108 0.18100744
## district -0.1485355 -0.11711470 0.00730706
## lng bedrooms bathrooms builtUp
## rental -0.1080804 1.49809e-01 0.5930987 0.0709167
## pricePerSizeUnitBuiltUp -0.0256623 -4.01988e-02 -0.1211590 -0.0192745
## lat -0.0689941 2.21016e-02 -0.0306474 0.0444334
## lng 1.0000000 -6.47466e-02 -0.0636709 -0.0242899
## bedrooms -0.0647466 1.00000e+00 0.2936839 0.0279011
## bathrooms -0.0636709 2.93684e-01 1.0000000 0.0853851
## builtUp -0.0242899 2.79011e-02 0.0853851 1.0000000
## carParks -0.1219492 2.78792e-01 0.4357333 0.0719533
## furnishing 0.0257199 8.60426e-03 0.0479502 0.0485018
## buildingId 0.0790843 -1.26791e-01 -0.1312289 0.0109746
## unitType 0.0296154 -7.10433e-05 -0.1341334 -0.0031640
## propertyType -0.0430179 -1.20938e-01 -0.3226945 -0.0148999
## district -0.0295029 5.89924e-02 0.0421573 -0.0375064
## carParks furnishing buildingId unitType
## rental 0.3236340 -0.11645559 -0.18109846 -1.54432e-01
## pricePerSizeUnitBuiltUp -0.0545660 -0.19106246 0.00710752 -1.04958e-03
## lat 0.0162443 0.09729998 0.09962887 2.96052e-02
## lng -0.1219492 0.02571992 0.07908426 2.96154e-02
## bedrooms 0.2787920 0.00860426 -0.12679051 -7.10433e-05
## bathrooms 0.4357333 0.04795020 -0.13122887 -1.34133e-01
## builtUp 0.0719533 0.04850179 0.01097463 -3.16400e-03
## carParks 1.0000000 0.02674755 0.12640943 -5.51677e-02
## furnishing 0.0267475 1.00000000 0.10128448 1.37606e-01
## buildingId 0.1264094 0.10128448 1.00000000 6.55726e-02
## unitType -0.0551677 0.13760564 0.06557264 1.00000e+00
## propertyType -0.1606065 -0.07963709 0.30500354 6.60764e-02
## district -0.0791855 0.07043111 -0.03447826 7.62658e-03
## propertyType district
## rental -0.1227679 -0.14853551
## pricePerSizeUnitBuiltUp 0.1664011 -0.11711470
## lat 0.1810074 0.00730706
## lng -0.0430179 -0.02950288
## bedrooms -0.1209377 0.05899242
## bathrooms -0.3226945 0.04215730
## builtUp -0.0148999 -0.03750641
## carParks -0.1606065 -0.07918547
## furnishing -0.0796371 0.07043111
## buildingId 0.3050035 -0.03447826
## unitType 0.0660764 0.00762658
## propertyType 1.0000000 -0.04668294
## district -0.0466829 1.00000000
The objective of this section is to classify whether a house is expensive or cheap. We are going to take the median value instead of mean (because our dataset is not evenly distributed).
First, we will create a new target variale called “rental_class”. 0 indicates below average and 1 indicates above average.
median_rental <- median(master_data_clean$rental)
master_data_clean$rental_class <- ifelse(master_data_clean$rental > median_rental, 1, 0)
print(median_rental)
## [1] 2000
From our dataset, the we take the average monthly rental price of RM 2000, perform Random Forest classificasion:
master_data_analysis <- master_data_clean[3:15]
master_data_analysis$rental_class <- as.character(master_data_analysis$rental_class)
master_data_analysis$rental_class <- as.factor(master_data_analysis$rental_class)
rf_model <- randomForest(rental_class ~ ., data=master_data_analysis, proximity=TRUE)
varImpPlot(rf_model, main="Random Forest Model Classification")
Based on the MeanDecreaseGini chart above, we can conclude that the important features for classification are: builtUp size, pricePerSizeUnitBuiltUp, district, bedrooms and bathrooms.
Next Let’s print out the confusion matrix of our RF model.
rf_model
##
## Call:
## randomForest(formula = rental_class ~ ., data = master_data_analysis, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 2.44%
## Confusion matrix:
## 0 1 class.error
## 0 2374 51 0.0210309
## 1 60 2064 0.0282486
The total number of decision trees created in this model is 500, the error rate is 2.44% which is considered quite low.
Objective: To build a linear model to predict house price.
Target Variable: rental
First predictor: bathrooms
Second predictor: pricePerSizeUnitBuiltUp
Let’s calculate correlation coefficients between target variables and the two predictors for 4 different datasets that we have sampled:
Dataset 1 is a collection of properties from 2 areas: Batu Caves, Bukit Jalil and Bukit Bintang.
new_dataset1 <- master_data_clean %>%
filter(district == 'Batu Caves'|district == 'Bukit Jalil'|district == 'Bukit Bintang')
dataset1_corb <- cor.test(new_dataset1$rental,new_dataset1$bathrooms,method = "pearson")
dataset1_corp <- cor.test(new_dataset1$rental,new_dataset1$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset1_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.301215"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset1_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.756856"
new_dataset1 <- new_dataset1 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset1[,c(2,3,4)],histogram=TRUE,pch="19")
Dataset 2 consists of properties from: Batu Caves, Cheras, Kuchai Lama, Sri Hartamas and Sri Petaling.
new_dataset2 <- master_data_clean %>%
filter(district == 'Batu Caves'|district == 'Cheras'|district == 'Kuchai Lama'|district == 'Sri Hartamas'|district == 'Sri Petaling')
dataset2_corb <- cor.test(new_dataset2$rental,new_dataset2$bathrooms,method = "pearson")
dataset2_corp <- cor.test(new_dataset2$rental,new_dataset2$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset2_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.641577"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset2_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.278929"
new_dataset2 <- new_dataset2 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset2[,c(2,3,4)],histogram=TRUE,pch="19")
Dataset 3 is a compilation of properties from Sri Hartamas, Bangsar and Mont Kiara
new_dataset3 <- master_data_clean %>%
filter(district == 'Sri Hartamas'| district == 'Bangsar'|district == 'Mont Kiara')
dataset3_corb <- cor.test(new_dataset3$rental,new_dataset3$bathrooms,method = "pearson")
dataset3_corp <- cor.test(new_dataset3$rental,new_dataset3$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset3_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.886179"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset3_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: -0.021008"
new_dataset3 <- new_dataset3 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset3[,c(2,3,4)],histogram=TRUE,pch="19")
Dataset 4 is a set of units from Batu Caves, Cheras, Kuchai Lama, Sri Hartamas and Sri Petaling.
new_dataset4 <- master_data_clean %>%
filter(district == 'Batu Caves'|district == 'Cheras'|district == 'Kuchai Lama'|district == 'Sri Hartamas'|district == 'Sri Petaling')
dataset4_corb <- cor.test(new_dataset4$rental,new_dataset4$bathrooms,method = "pearson")
dataset4_corp <- cor.test(new_dataset4$rental,new_dataset4$pricePerSizeUnitBuiltUp,method = "pearson")
sprintf("Correlation coefficient of rental vs bathrooms: %f", dataset4_corb$estimate)
## [1] "Correlation coefficient of rental vs bathrooms: 0.641577"
sprintf("Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: %f", dataset4_corp$estimate)
## [1] "Correlation coefficient of rental vs pricePerSizeUnitBuiltUp: 0.278929"
new_dataset4 <- new_dataset4 %>% select(rental, pricePerSizeUnitBuiltUp, bathrooms)
chart.Correlation(new_dataset4[,c(2,3,4)],histogram=TRUE,pch="19")
From section 6.1, based on the higher correlation values (with each dependent variables), we choose dataset 1 and dataset 3 to continue with linear regression models.
new_dataset <- rbind(new_dataset1,new_dataset3)
dim(new_dataset)
## [1] 1109 4
Next, we are going to split our dataset into training set and test set with 70-30 ratio.
set.seed(123)
split = 0.70 #define an 70%/30% training/testing split of the dataset.
training_set_index <- createDataPartition(new_dataset$district,p=split,list=FALSE)
trainingset <- new_dataset[training_set_index,]
testset <- new_dataset[-training_set_index,]
dim(trainingset)
## [1] 779 4
dim(testset)
## [1] 330 4
We now have 779 rows in training dataset and 330 rows in test dataset.
We are going to train two linear regression models based on the train-test samples.
* Model 1: has one predictor which is the most correlated column with the #target(i.e bathrooms).
* Model 2: has two predictors, bathrooms and pricePerSizeUnitBuiltUp
The first model will use only one predictor.
linear_model_1_predictor<-lm(rental~bathrooms,data=trainingset)
summary(linear_model_1_predictor)
##
## Call:
## lm(formula = rental ~ bathrooms, data = trainingset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3161 -1106 -106 650 6894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 193.9 123.7 1.57 0.12
## bathrooms 1455.8 51.9 28.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1570 on 777 degrees of freedom
## Multiple R-squared: 0.503, Adjusted R-squared: 0.502
## F-statistic: 787 on 1 and 777 DF, p-value: <2e-16
par(mfrow=c(2,2))
plot(linear_model_1_predictor)
The second model uses two predictors against target variable “rental”.
linear_model_2_predictor<-lm(rental~pricePerSizeUnitBuiltUp+bathrooms,data=trainingset)
summary(linear_model_2_predictor)
##
## Call:
## lm(formula = rental ~ pricePerSizeUnitBuiltUp + bathrooms, data = trainingset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3031 -442 -19 421 4404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3237.8 127.0 -25.5 <2e-16 ***
## pricePerSizeUnitBuiltUp 1044.7 30.5 34.3 <2e-16 ***
## bathrooms 1683.7 33.4 50.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 994 on 776 degrees of freedom
## Multiple R-squared: 0.802, Adjusted R-squared: 0.802
## F-statistic: 1.57e+03 on 2 and 776 DF, p-value: <2e-16
par(mfrow=c(2,2))
plot(linear_model_2_predictor)
In this section, we are going to perform predictions with the models we trained in section 6.3.
#Predictor1
prediction_model1 <- predict(linear_model_1_predictor, testset)
#Predictor2
prediction_model2 <- predict(linear_model_2_predictor, testset)
#plots
par(mfrow=c(2,2))
plot(testset$rental,type="l",lty=1.8,col="red")
lines(prediction_model1,type="l",col="blue")
plot(prediction_model1,type="l",lty=1.8,col="blue")
plot(testset$rental,type="l",lty=1.8,col="red")
lines(prediction_model2,type="l",col="blue")
plot(prediction_model2,type="l",lty=1.8,col="blue")
mtext("Predictions with model 1 (above) and model2 (below)", side = 3, line = -12, outer = TRUE)
Here are two scatter plots of the performance of model 2 over training set vs test set.
plot_predictions <- function(linear_model_predictor, dataset){
pred = predict(linear_model_predictor,dataset)
act = dataset$rental
tst = data.frame(district = dataset$district,prediction = pred,actual = act)
tst = mutate(tst,delta = prediction - actual)
tstvar = round(var(tst$delta))
plot <- ggplot(tst,aes(x=actual,y=prediction,color=district)) + geom_point() + geom_abline(intercept = 0, slope = 1, linetype="dashed")
plot
}
p1 <- plot_predictions(linear_model_2_predictor, trainingset)
p2 <- plot_predictions(linear_model_2_predictor, testset)
grid.arrange(p1, p2, ncol=2)
Let’s calculate accuracy of the above two models:
calculate_accuracy_trainset <- function(linear_model, dataset){
# Fitted values of train data
pred_1_train = linear_model$fitted.values
# Store actual price and predicted price in a data frame
actual_fitted = data.frame(actual=dataset$rental,predicted=pred_1_train)
# Mean of absolute difference between predicted values and actual values
abs_diff = mean(abs(actual_fitted$actual-actual_fitted$predicted)/actual_fitted$actual)
# % accuracy of the prediction with trainigset1
accuracy_train = 1 - abs_diff
accuracy_train
}
calculate_accuracy_testset <- function(linear_model, dataset){
pred_1_test=predict(linear_model,dataset)
# Store actual price and predicted price in a data frame
actual_fitted_test=data.frame(actual=dataset$rental, predicted=pred_1_test)
# Mean of absolute difference between predicted values and actual values
abs_diff_test = mean(abs(actual_fitted_test$actual-actual_fitted_test$predicted)/actual_fitted_test$actual)
# % accuracy of the prediction with testset1
accuracy_test = 1-abs_diff_test
accuracy_test
}
percent <- function(x, digits = 2, format = "f", ...) {
paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}
m1_train <- calculate_accuracy_trainset(linear_model_1_predictor, trainingset)
m1_test <- calculate_accuracy_testset(linear_model_1_predictor, testset)
m2_train <- calculate_accuracy_trainset(linear_model_2_predictor, trainingset)
m2_test <- calculate_accuracy_testset(linear_model_2_predictor, testset)
Accuracy scores of the two model on trining and test data set as follows:
| model | train set | test set |
| model 1 predictor | 58.65% | 77.03% |
| model 2 predictor | 58.56% | 77.59% |
Next we will compute the kendall’s tau accuracy for the predictions of the two prediction models above, to check how the rankings of the predicted values correlate with the actual target values in the test set.
kt_for_model1_predictor <- cor(prediction_model1,testset$rental,method = "kendall")
kt_for_model2_predictor <- cor(prediction_model2,testset$rental,method = "kendall")
Let’s print out the accuracy scores:
| model | Kendall’s Tau Accuracy |
| model1 predictor | 47.95% |
| model2 predictor | 71.26% |
Based on Kendalls tau correlations, we can conclude that the model with two predictors perform better.
For significance testing, we will run the models 50 times. Each time, we perform predictions on the test set and compute the Kendall’s tau correlation. The Kendall’s tau correlations for the models are kept separately in data two data frames.
dataframe_prediction_one_predictor<-data.frame(prediction_one_predictor=double())
dataframe_prediction_two_predictor<-data.frame(prediction_two_predictor=double())
for (i in 1:50){
linear_model_1_predictor<-lm(rental~bathrooms,data=trainingset)
linear_model_2_predictor<-lm(rental~pricePerSizeUnitBuiltUp+bathrooms,data=trainingset)
prediction_for_model_with_1_predictor<-predict(linear_model_1_predictor,testset)
prediction_for_model_with_2_predictors<-predict(linear_model_2_predictor,testset)
#Compute the kendall’s tau correlations for the predictions of the models.
newRow_1_predictor<-data.frame(prediction_one_predictor=cor(prediction_for_model_with_1_predictor,testset$rental,method = "kendall"))
dataframe_prediction_one_predictor<-rbind(dataframe_prediction_one_predictor,newRow_1_predictor)
newRow_2_predictor<-data.frame(prediction_two_predictor=cor(prediction_for_model_with_2_predictors,testset$rental,method = "kendall"))
dataframe_prediction_two_predictor<-rbind(dataframe_prediction_two_predictor,newRow_2_predictor)
}
Next, we are going to conduct nonparametric test due to the population sample size is small.
qqnorm(dataframe_prediction_two_predictor$prediction_two_predictor, pch = 1, frame = FALSE)
qqline(dataframe_prediction_two_predictor$prediction_two_predictor, col = "steelblue", lwd = 2)
A nonparametric test is obtained because qqplot shows the graph is not normally distributed. Next, we perform Wilcoxon test for significance using both data frames.
wilcox.test(dataframe_prediction_one_predictor$prediction_one_predictor,dataframe_prediction_two_predictor$prediction_two_predictor,exact = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: dataframe_prediction_one_predictor$prediction_one_predictor and dataframe_prediction_two_predictor$prediction_two_predictor
## W = 0, p-value <2e-16
## alternative hypothesis: true location shift is not equal to 0