By:
Wendy Chong Pooi Mun, S2019483
Soong Woei Meng, 17147287
Introduction:
The real estate sector is highly subject to fluctuations in prices due to its correlations with numerous variables. Property prices may increase rapidly, or in some cases may also fall quickly, yet the online sites of property listings are not likely to be updated on a frequent basis. When the listed price of a property is below its actual market price by a certain magnitude, we are presented with a profitable investment opportunity.
Contents:
1. Initial Dataset
2. Problem Statement
3. Pre-processing
4. Exploratory Data Analysis
4.1 Identifying Outliers
4.2 Bivariate Correlation Matrix
4.3 Correlation-Scatter Plot
4.4 Frequency of Listings based on Districts
5. Regression
6. Classification
7. Conclusion
1. Initial Dataset:
Title: Property Listings in Kuala Lumpur
Year: July 2019
Purpose of Dataset: To provide data for comparison between properties in Kuala Lumpur
Dimension: Num.Rows:52,604 Num.Columns:8
Source: https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur
library(readr)
library(knitr)
DF <- read.csv(file = 'data.csv', header = TRUE) #import data csv
DF <- as.data.frame(DF)
head(DF)
## ï..Location Price Rooms Bathrooms Car.Parks
## 1 KLCC, Kuala Lumpur RM 1,250,000 2+1 3 2
## 2 Damansara Heights, Kuala Lumpur RM 6,800,000 6 7 NA
## 3 Dutamas, Kuala Lumpur RM 1,030,000 3 4 2
## 4 Bukit Jalil, Kuala Lumpur RM 900,000 4+1 3 2
## 5 Taman Tun Dr Ismail, Kuala Lumpur RM 5,350,000 4+2 5 4
## 6 Taman Tun Dr Ismail, Kuala Lumpur RM 2,600,000 5 4 4
## Property.Type Size Furnishing
## 1 Serviced Residence Built-up : 1,335 sq. ft. Fully Furnished
## 2 Bungalow Land area : 6900 sq. ft. Partly Furnished
## 3 Condominium (Corner) Built-up : 1,875 sq. ft. Partly Furnished
## 4 Condominium (Corner) Built-up : 1,513 sq. ft. Partly Furnished
## 5 Bungalow Land area : 7200 sq. ft. Partly Furnished
## 6 Semi-detached House Land area : 3600 sq. ft. Partly Furnished
dim(DF) #Dimension: 52,604 rows and 8 variables
## [1] 52604 8
str(DF) #To access structure of DF
## 'data.frame': 52604 obs. of 8 variables:
## $ ï..Location : chr "KLCC, Kuala Lumpur" "Damansara Heights, Kuala Lumpur" "Dutamas, Kuala Lumpur" "Bukit Jalil, Kuala Lumpur" ...
## $ Price : chr "RM 1,250,000" "RM 6,800,000" "RM 1,030,000" "RM 900,000" ...
## $ Rooms : chr "2+1" "6" "3" "4+1" ...
## $ Bathrooms : int 3 7 4 3 5 4 4 2 3 3 ...
## $ Car.Parks : int 2 NA 2 2 4 4 3 1 NA NA ...
## $ Property.Type: chr "Serviced Residence" "Bungalow" "Condominium (Corner)" "Condominium (Corner)" ...
## $ Size : chr "Built-up : 1,335 sq. ft." "Land area : 6900 sq. ft." "Built-up : 1,875 sq. ft." "Built-up : 1,513 sq. ft." ...
## $ Furnishing : chr "Fully Furnished" "Partly Furnished" "Partly Furnished" "Partly Furnished" ...
summary(DF) #To access length and class for categorical variables; and display descriptive statistics for numerical variables
## ï..Location Price Rooms Bathrooms
## Length:52604 Length:52604 Length:52604 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 2.000
## Mode :character Mode :character Mode :character Median : 3.000
## Mean : 3.083
## 3rd Qu.: 4.000
## Max. :20.000
## NA's :1832
## Car.Parks Property.Type Size Furnishing
## Min. : 1.000 Length:52604 Length:52604 Length:52604
## 1st Qu.: 1.000 Class :character Class :character Class :character
## Median : 2.000 Mode :character Mode :character Mode :character
## Mean : 2.012
## 3rd Qu.: 2.000
## Max. :30.000
## NA's :17085
2. Problem Statement:
Questions to be answered
1) How much is a property’s price per square feet influenced by its features (e.g. number of bedrooms, location and type of property)?
2) How to determine if a property is overvalued or undervalued given its features?
To answer question 1, we used the linear multiple regression model. Whilst for question 2, we adopted the k-nearest neighbors algorithm.
Therefore, for variables (e.g. bathrooms and size), we need to remove all strings and coerce the data into numeric data for the model. Some variables such as Furnishing will be ranked accordingly based on market convention that fully furnished often offers better value than partly furnished. Also, dummy variables will be created for the location and property type as they are nominal data. Notably, the locations will be grouped into their respective districts for practical purpose. Finally, new variables will also be created as the dependent variables.
3. Pre-processing:
#load packages to clean dataset
library(tidyr)
library(stringi)
library(stringr)
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Price of properties:
After converting strings of the prices into numerics, we have filtered properties with a price range between RM 350,000 and RM 650,000. The rationale is that these mid-range priced properties appeal better to the majority of investors in terms of risk-return appetite.
#Convert strings of "Price" to numerical values and store in new variable "Price_RM"
DF$Price_RM <- str_remove_all(DF$Price,"[a-zA-Z ]")
DF$Price_RM <- as.numeric(str_remove_all(DF$Price_RM,"\\,"))
DF<-(DF[,(-2)])
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 KLCC, Kuala Lumpur 2+1 3 2
## 2 Damansara Heights, Kuala Lumpur 6 7 NA
## 3 Dutamas, Kuala Lumpur 3 4 2
## 4 Bukit Jalil, Kuala Lumpur 4+1 3 2
## 5 Taman Tun Dr Ismail, Kuala Lumpur 4+2 5 4
## 6 Taman Tun Dr Ismail, Kuala Lumpur 5 4 4
## Property.Type Size Furnishing Price_RM
## 1 Serviced Residence Built-up : 1,335 sq. ft. Fully Furnished 1250000
## 2 Bungalow Land area : 6900 sq. ft. Partly Furnished 6800000
## 3 Condominium (Corner) Built-up : 1,875 sq. ft. Partly Furnished 1030000
## 4 Condominium (Corner) Built-up : 1,513 sq. ft. Partly Furnished 900000
## 5 Bungalow Land area : 7200 sq. ft. Partly Furnished 5350000
## 6 Semi-detached House Land area : 3600 sq. ft. Partly Furnished 2600000
dim(DF)
## [1] 52604 8
#Select Price_RM ranging between 350,000 and 650,000. Reason: Mid-range properties fit majority investors' appetite.
DF <- filter(DF, !is.na(Price_RM))
DF <- filter(DF, Price_RM > 350000)
DF <- filter(DF, Price_RM < 650000)
dim(DF) #noted observations have reduced to 12,839 from 52,604
## [1] 12839 8
Size of properties:
For the size of properties, in converting to numerics from strings, we have also performed standardisation to the elements displayed by using mainly the function separate().
#Convert strings of "Size" to numerical values and store in new variable "Size_sqft"
DF$Size_sqft <- str_remove_all(DF$Size,"\\,")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\.")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\:")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\-")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"[a-wA-W ]")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"[y-zY-Z ]")
#Next, we need to standardise the display formats of "Size_sqft". There are some values presented in text format of length X width, which we will standardise by showing only the numeric output after multiplication.
DF$Size_sqft <- str_replace(DF$Size_sqft,"[x-xX-X ]","X") #to standardise the separator
DF <- DF %>%
separate(Size_sqft,c("first","second"),sep="X")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12009 rows [1, 2,
## 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].
DF$second <- replace(DF$second, is.na(DF$second), 1)
DF <- DF[!(DF$first=="0"), ] #remove missing values & errors, rationale: our dependent variable pertains to price per square feet
DF <- DF[!(DF$first=="1"), ]
DF <- DF[!(DF$first=="/"), ]
DF <- DF[!(DF$first=="`1760"), ]
DF <- DF[!(DF$first==""), ]
DF <- transform(DF, first = as.numeric(first))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
typeof(DF$first)
## [1] "double"
DF <- DF[- grep("\\+", DF$second),]
DF <- DF[!(DF$second==""), ]
DF <- DF[!(DF$second=="0"), ]
DF <- transform(DF, second = as.numeric(second))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling, Kuala Lumpur 3 2 1
## 2 Bangsar South, Kuala Lumpur 1 1 1
## 3 Bukit Jalil, Kuala Lumpur 3+1 2 2
## 4 Kepong, Kuala Lumpur 3 2 NA
## 5 Ampang Hilir, Kuala Lumpur 1 1 1
## 6 Setapak, Kuala Lumpur 4+1 2 2
## Property.Type Size
## 1 Apartment (Intermediate) Built-up : 904 sq. ft.
## 2 Serviced Residence Built-up : 520 sq. ft.
## 3 Condominium (Intermediate) Built-up : 1,236 sq. ft.
## 4 2-sty Terrace/Link House (Intermediate) Land area : 16x55 sq. ft.
## 5 Serviced Residence Built-up : 613 sq. ft.
## 6 Condominium (Corner) Built-up : 1,550 sq. ft.
## Furnishing Price_RM first second
## 1 Partly Furnished 385000 904 1
## 2 Fully Furnished 490000 520 1
## 3 Partly Furnished 610000 1236 1
## 4 Partly Furnished 560000 16 55
## 5 Fully Furnished 460000 613 1
## 6 Partly Furnished 580000 1550 1
Size_TotalSqft = DF$first * DF$second #to multiply length X width in sqft, and store products in new variable "Size_TotalSqft"
DF=cbind(DF,Size_TotalSqft) #to bind new variable "Size_TotalSqft" with DF
DF<-(DF[,-c(6,9,10)]) #Remove variables which are no longer required
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling, Kuala Lumpur 3 2 1
## 2 Bangsar South, Kuala Lumpur 1 1 1
## 3 Bukit Jalil, Kuala Lumpur 3+1 2 2
## 4 Kepong, Kuala Lumpur 3 2 NA
## 5 Ampang Hilir, Kuala Lumpur 1 1 1
## 6 Setapak, Kuala Lumpur 4+1 2 2
## Property.Type Furnishing Price_RM
## 1 Apartment (Intermediate) Partly Furnished 385000
## 2 Serviced Residence Fully Furnished 490000
## 3 Condominium (Intermediate) Partly Furnished 610000
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished 560000
## 5 Serviced Residence Fully Furnished 460000
## 6 Condominium (Corner) Partly Furnished 580000
## Size_TotalSqft
## 1 904
## 2 520
## 3 1236
## 4 880
## 5 613
## 6 1550
Our interest is in the price of property per square feet. Hence, we have created the new variable “Price_persqft_RM” as follow:
Price_persqft_RM = DF$Price_RM / DF$Size_TotalSqft #
DF=cbind(DF,Price_persqft_RM)
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling, Kuala Lumpur 3 2 1
## 2 Bangsar South, Kuala Lumpur 1 1 1
## 3 Bukit Jalil, Kuala Lumpur 3+1 2 2
## 4 Kepong, Kuala Lumpur 3 2 NA
## 5 Ampang Hilir, Kuala Lumpur 1 1 1
## 6 Setapak, Kuala Lumpur 4+1 2 2
## Property.Type Furnishing Price_RM
## 1 Apartment (Intermediate) Partly Furnished 385000
## 2 Serviced Residence Fully Furnished 490000
## 3 Condominium (Intermediate) Partly Furnished 610000
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished 560000
## 5 Serviced Residence Fully Furnished 460000
## 6 Condominium (Corner) Partly Furnished 580000
## Size_TotalSqft Price_persqft_RM
## 1 904 425.8850
## 2 520 942.3077
## 3 1236 493.5275
## 4 880 636.3636
## 5 613 750.4078
## 6 1550 374.1935
DF<-(DF[,-c(7)]) #remove the variable of "Price_RM"
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling, Kuala Lumpur 3 2 1
## 2 Bangsar South, Kuala Lumpur 1 1 1
## 3 Bukit Jalil, Kuala Lumpur 3+1 2 2
## 4 Kepong, Kuala Lumpur 3 2 NA
## 5 Ampang Hilir, Kuala Lumpur 1 1 1
## 6 Setapak, Kuala Lumpur 4+1 2 2
## Property.Type Furnishing Size_TotalSqft
## 1 Apartment (Intermediate) Partly Furnished 904
## 2 Serviced Residence Fully Furnished 520
## 3 Condominium (Intermediate) Partly Furnished 1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished 880
## 5 Serviced Residence Fully Furnished 613
## 6 Condominium (Corner) Partly Furnished 1550
## Price_persqft_RM
## 1 425.8850
## 2 942.3077
## 3 493.5275
## 4 636.3636
## 5 750.4078
## 6 374.1935
dim(DF)
## [1] 12806 8
DF<-filter(DF,!is.na(DF$Price_persqft_RM)) #remove missing values, rationale: our dependent variable pertains to price
dim(DF)
## [1] 12789 8
Property Locations:
Here, we have mainly processed the data by classifying the 54 locations into 11 districts.
#Since all the listings are for properties in Kuala Lumpur, we can remove the string ", Kuala Lumpur"
DF$ï..Location <- str_remove_all(DF$ï..Location,", Kuala Lumpur")
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling 3 2 1
## 2 Bangsar South 1 1 1
## 3 Bukit Jalil 3+1 2 2
## 4 Kepong 3 2 NA
## 5 Ampang Hilir 1 1 1
## 6 Setapak 4+1 2 2
## Property.Type Furnishing Size_TotalSqft
## 1 Apartment (Intermediate) Partly Furnished 904
## 2 Serviced Residence Fully Furnished 520
## 3 Condominium (Intermediate) Partly Furnished 1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished 880
## 5 Serviced Residence Fully Furnished 613
## 6 Condominium (Corner) Partly Furnished 1550
## Price_persqft_RM
## 1 425.8850
## 2 942.3077
## 3 493.5275
## 4 636.3636
## 5 750.4078
## 6 374.1935
#Classify locations into districts to ease analysis
Locations_Unq<-unique(DF$ï..Location) #to remove duplicated values from Location and store in new vector
write.csv(Locations_Unq, "Locations_Unq.csv") #to export the new vector into csv file for us to classify into districts
#Import csv file containing districts which correspond to each location
Districts<-read.csv(file = 'Districts.csv', header = TRUE)
head(Districts)
## ï..x Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1 Sri Petaling Seputeh 0 0 0 0 0
## 2 Bangsar South Lembah Pantai 0 0 0 0 0
## 3 Bukit Jalil Seputeh 0 0 0 0 0
## 4 Kepong Kepong 0 0 0 0 1
## 5 Ampang Hilir Titiwangsa 0 1 0 0 0
## 6 Setapak Segambut North 0 0 0 0 0
## Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1 0 0 0 1 0 0
## 2 0 0 1 0 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 1 0 0 0 0
#Merge the data frame of "Districts" with "DF"
DF<-left_join(DF, Districts, by = c("ï..Location" = "ï..x"),all.x=TRUE)
head(DF)
## ï..Location Rooms Bathrooms Car.Parks
## 1 Sri Petaling 3 2 1
## 2 Bangsar South 1 1 1
## 3 Bukit Jalil 3+1 2 2
## 4 Kepong 3 2 NA
## 5 Ampang Hilir 1 1 1
## 6 Setapak 4+1 2 2
## Property.Type Furnishing Size_TotalSqft
## 1 Apartment (Intermediate) Partly Furnished 904
## 2 Serviced Residence Fully Furnished 520
## 3 Condominium (Intermediate) Partly Furnished 1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished 880
## 5 Serviced Residence Fully Furnished 613
## 6 Condominium (Corner) Partly Furnished 1550
## Price_persqft_RM Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu
## 1 425.8850 Seputeh 0 0 0 0
## 2 942.3077 Lembah Pantai 0 0 0 0
## 3 493.5275 Seputeh 0 0 0 0
## 4 636.3636 Kepong 0 0 0 0
## 5 750.4078 Titiwangsa 0 1 0 0
## 6 374.1935 Segambut North 0 0 0 0
## Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak
## 1 0 0 0 0 1 0
## 2 0 0 0 1 0 0
## 3 0 0 0 0 1 0
## 4 1 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 1 0 0 0
## Cheras
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
#Remove Locations variable
DF<-(DF[,-c(1)])
head(DF) #instead of having 54 Locations per initial dataset, we now only need to focus on 11 districts
## Rooms Bathrooms Car.Parks Property.Type
## 1 3 2 1 Apartment (Intermediate)
## 2 1 1 1 Serviced Residence
## 3 3+1 2 2 Condominium (Intermediate)
## 4 3 2 NA 2-sty Terrace/Link House (Intermediate)
## 5 1 1 1 Serviced Residence
## 6 4+1 2 2 Condominium (Corner)
## Furnishing Size_TotalSqft Price_persqft_RM Districts Bukit.Bintang
## 1 Partly Furnished 904 425.8850 Seputeh 0
## 2 Fully Furnished 520 942.3077 Lembah Pantai 0
## 3 Partly Furnished 1236 493.5275 Seputeh 0
## 4 Partly Furnished 880 636.3636 Kepong 0
## 5 Fully Furnished 613 750.4078 Titiwangsa 0
## 6 Partly Furnished 1550 374.1935 Segambut North 0
## Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Segambut.North
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 1 0 0
## 5 1 0 0 0 0 0
## 6 0 0 0 0 0 1
## Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1 0 1 0 0
## 2 1 0 0 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
To address the missing values
To imputate the missing values for number of bedrooms, bathrooms and car parks, we used the linear regression model.
DF <- DF %>%
separate(Rooms,c("Bedrooms","second"),sep="\\+")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 11159 rows [1, 2,
## 4, 5, 7, 8, 11, 12, 13, 14, 17, 18, 19, 20, 22, 24, 26, 27, 28, 29, ...].
DF$Bedrooms <- replace(DF$Bedrooms, DF$Bedrooms=="Studio", 1)
DF$Bedrooms <- as.numeric(DF$Bedrooms)
DF$second<-replace(DF$second, is.na(DF$second), 0)
DF$second<-replace(DF$second, DF$second == 1, 0.5)
DF$second<-replace(DF$second, DF$second == 2, 1)
DF$second<-as.numeric(DF$second)
DF$Number_Bedrooms <- DF$Bedrooms + DF$second
head(DF)
## Bedrooms second Bathrooms Car.Parks Property.Type
## 1 3 0.0 2 1 Apartment (Intermediate)
## 2 1 0.0 1 1 Serviced Residence
## 3 3 0.5 2 2 Condominium (Intermediate)
## 4 3 0.0 2 NA 2-sty Terrace/Link House (Intermediate)
## 5 1 0.0 1 1 Serviced Residence
## 6 4 0.5 2 2 Condominium (Corner)
## Furnishing Size_TotalSqft Price_persqft_RM Districts Bukit.Bintang
## 1 Partly Furnished 904 425.8850 Seputeh 0
## 2 Fully Furnished 520 942.3077 Lembah Pantai 0
## 3 Partly Furnished 1236 493.5275 Seputeh 0
## 4 Partly Furnished 880 636.3636 Kepong 0
## 5 Fully Furnished 613 750.4078 Titiwangsa 0
## 6 Partly Furnished 1550 374.1935 Segambut North 0
## Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Segambut.North
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 1 0 0
## 5 1 0 0 0 0 0
## 6 0 0 0 0 0 1
## Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms
## 1 0 1 0 0 3.0
## 2 1 0 0 0 1.0
## 3 0 1 0 0 3.5
## 4 0 0 0 0 3.0
## 5 0 0 0 0 1.0
## 6 0 0 0 0 4.5
DF <- DF[,-c(1,2)]
head(DF)
## Bathrooms Car.Parks Property.Type Furnishing
## 1 2 1 Apartment (Intermediate) Partly Furnished
## 2 1 1 Serviced Residence Fully Furnished
## 3 2 2 Condominium (Intermediate) Partly Furnished
## 4 2 NA 2-sty Terrace/Link House (Intermediate) Partly Furnished
## 5 1 1 Serviced Residence Fully Furnished
## 6 2 2 Condominium (Corner) Partly Furnished
## Size_TotalSqft Price_persqft_RM Districts Bukit.Bintang Titiwangsa
## 1 904 425.8850 Seputeh 0 0
## 2 520 942.3077 Lembah Pantai 0 0
## 3 1236 493.5275 Seputeh 0 0
## 4 880 636.3636 Kepong 0 0
## 5 613 750.4078 Titiwangsa 0 1
## 6 1550 374.1935 Segambut North 0 0
## Wangsa.Maju Batu Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh
## 1 0 0 0 0 0 0 1
## 2 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 1
## 4 0 0 1 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 1 0 0
## Bandar.Tun.Razak Cheras Number_Bedrooms
## 1 0 0 3.0
## 2 0 0 1.0
## 3 0 0 3.5
## 4 0 0 3.0
## 5 0 0 1.0
## 6 0 0 4.5
#For missing values in Number_Bedrooms, we imputated them using simple linear regression method
DF$Number_Bedrooms<-replace(DF$Number_Bedrooms, is.na(DF$Number_Bedrooms), 0)
lm(Number_Bedrooms ~ Size_TotalSqft, DF) #regress y=Number_Bedrooms against x=Size_TotalSqft
##
## Call:
## lm(formula = Number_Bedrooms ~ Size_TotalSqft, data = DF)
##
## Coefficients:
## (Intercept) Size_TotalSqft
## 2.768e+00 3.050e-09
for(i in 1:nrow(DF))
{
if(DF$Number_Bedrooms[i] == 0)
{
DF$Number_Bedrooms[i] = 2.768e+00 + 3.050e-09 *DF$Size_TotalSqft[i]
}
}
#For missing values in Bathrooms, we imputated them using simple linear regression method
DF$Bathrooms <- replace(DF$Bathrooms, is.na(DF$Bathrooms), 0)
lm(Bathrooms ~ Size_TotalSqft, DF) #y=Bathrooms against x=Size_TotalSqft
##
## Call:
## lm(formula = Bathrooms ~ Size_TotalSqft, data = DF)
##
## Coefficients:
## (Intercept) Size_TotalSqft
## 1.967e+00 2.121e-08
for(i in 1:nrow(DF))
{
if(DF$Bathrooms[i] == 0)
{
DF$Bathrooms[i] = 1.967e+00 + 2.121e-08 *DF$Size_TotalSqft[i]
}
}
DF$Car.Parks <- replace(DF$Car.Parks, is.na(DF$Car.Parks), 0)
lm(Car.Parks ~ Size_TotalSqft, DF)
##
## Call:
## lm(formula = Car.Parks ~ Size_TotalSqft, data = DF)
##
## Coefficients:
## (Intercept) Size_TotalSqft
## 1.148e+00 7.843e-08
for(i in 1:nrow(DF))
{
if(DF$Car.Parks[i] == 0)
{
DF$Car.Parks[i] = 1.148e+00 + 7.843e-08 *DF$Size_TotalSqft[i]
}
}
head(DF)
## Bathrooms Car.Parks Property.Type Furnishing
## 1 2 1.000000 Apartment (Intermediate) Partly Furnished
## 2 1 1.000000 Serviced Residence Fully Furnished
## 3 2 2.000000 Condominium (Intermediate) Partly Furnished
## 4 2 1.148069 2-sty Terrace/Link House (Intermediate) Partly Furnished
## 5 1 1.000000 Serviced Residence Fully Furnished
## 6 2 2.000000 Condominium (Corner) Partly Furnished
## Size_TotalSqft Price_persqft_RM Districts Bukit.Bintang Titiwangsa
## 1 904 425.8850 Seputeh 0 0
## 2 520 942.3077 Lembah Pantai 0 0
## 3 1236 493.5275 Seputeh 0 0
## 4 880 636.3636 Kepong 0 0
## 5 613 750.4078 Titiwangsa 0 1
## 6 1550 374.1935 Segambut North 0 0
## Wangsa.Maju Batu Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh
## 1 0 0 0 0 0 0 1
## 2 0 0 0 0 0 1 0
## 3 0 0 0 0 0 0 1
## 4 0 0 1 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 1 0 0
## Bandar.Tun.Razak Cheras Number_Bedrooms
## 1 0 0 3.0
## 2 0 0 1.0
## 3 0 0 3.5
## 4 0 0 3.0
## 5 0 0 1.0
## 6 0 0 4.5
We ranked the furnishing as follow:
DF$Furnishing_Rank <- 1
DF$Furnishing_Rank[DF$Furnishing == "Partly Furnished"] <- 2
DF$Furnishing_Rank[DF$Furnishing == "Fully Furnished"] <- 3
DF <- DF[,-4]
head(DF)
## Bathrooms Car.Parks Property.Type Size_TotalSqft
## 1 2 1.000000 Apartment (Intermediate) 904
## 2 1 1.000000 Serviced Residence 520
## 3 2 2.000000 Condominium (Intermediate) 1236
## 4 2 1.148069 2-sty Terrace/Link House (Intermediate) 880
## 5 1 1.000000 Serviced Residence 613
## 6 2 2.000000 Condominium (Corner) 1550
## Price_persqft_RM Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu
## 1 425.8850 Seputeh 0 0 0 0
## 2 942.3077 Lembah Pantai 0 0 0 0
## 3 493.5275 Seputeh 0 0 0 0
## 4 636.3636 Kepong 0 0 0 0
## 5 750.4078 Titiwangsa 0 1 0 0
## 6 374.1935 Segambut North 0 0 0 0
## Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak
## 1 0 0 0 0 1 0
## 2 0 0 0 1 0 0
## 3 0 0 0 0 1 0
## 4 1 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 1 0 0 0
## Cheras Number_Bedrooms Furnishing_Rank
## 1 0 3.0 2
## 2 0 1.0 3
## 3 0 3.5 2
## 4 0 3.0 2
## 5 0 1.0 3
## 6 0 4.5 2
For property types, we converted into dummy variables."
DF[grep("Terrace",DF$Property.Type,value = F),"Terrace"] <- as.numeric(1)
DF[grep("Condominium",DF$Property.Type,value = F),"Condominium"] <- as.numeric(1)
DF[grep("Serviced Residence",DF$Property.Type,value = F),"Serviced Residence"] <- as.numeric(1)
DF[grep("Townhouse",DF$Property.Type,value = F),"Townhouse"] <- as.numeric(1)
DF$Terrace <- replace(DF$Terrace, is.na(DF$Terrace), 0)
DF$Condominium <- replace(DF$Condominium, is.na(DF$Condominium), 0)
DF$`Serviced Residence` <- replace(DF$`Serviced Residence`, is.na(DF$`Serviced Residence`), 0)
DF$Townhouse <- replace(DF$Townhouse, is.na(DF$Townhouse), 0)
DF <- DF[,-3]
head(DF)
## Bathrooms Car.Parks Size_TotalSqft Price_persqft_RM Districts
## 1 2 1.000000 904 425.8850 Seputeh
## 2 1 1.000000 520 942.3077 Lembah Pantai
## 3 2 2.000000 1236 493.5275 Seputeh
## 4 2 1.148069 880 636.3636 Kepong
## 5 1 1.000000 613 750.4078 Titiwangsa
## 6 2 2.000000 1550 374.1935 Segambut North
## Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong Segambut.South
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 1 0
## 5 0 1 0 0 0 0
## 6 0 0 0 0 0 0
## Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms
## 1 0 0 1 0 0 3.0
## 2 0 1 0 0 0 1.0
## 3 0 0 1 0 0 3.5
## 4 0 0 0 0 0 3.0
## 5 0 0 0 0 0 1.0
## 6 1 0 0 0 0 4.5
## Furnishing_Rank Terrace Condominium Serviced Residence Townhouse
## 1 2 0 0 0 0
## 2 3 0 0 1 0
## 3 2 0 1 0 0
## 4 2 1 0 0 0
## 5 3 0 0 1 0
## 6 2 0 1 0 0
Lastly, we renamed some variables.
names(DF)[names(DF)=="Bathrooms"]<-"Number_Bathrooms"
names(DF)[names(DF)=="Car.Parks"]<-"Number_CarParks"
Putting together all into a cleaned dataset:
sum(is.na(DF))
## [1] 0
str(DF)
## 'data.frame': 12789 obs. of 22 variables:
## $ Number_Bathrooms : num 2 1 2 2 1 2 2 2 1 1 ...
## $ Number_CarParks : num 1 1 2 1.15 1 ...
## $ Size_TotalSqft : num 904 520 1236 880 613 ...
## $ Price_persqft_RM : num 426 942 494 636 750 ...
## $ Districts : chr "Seputeh" "Lembah Pantai" "Seputeh" "Kepong" ...
## $ Bukit.Bintang : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Titiwangsa : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Wangsa.Maju : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Batu : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Kepong : int 0 0 0 1 0 0 0 0 0 0 ...
## $ Segambut.South : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Segambut.North : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Lembah.Pantai : int 0 1 0 0 0 0 0 0 1 1 ...
## $ Seputeh : int 1 0 1 0 0 0 1 1 0 0 ...
## $ Bandar.Tun.Razak : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Cheras : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number_Bedrooms : num 3 1 3.5 3 1 4.5 3 2 1.5 1.5 ...
## $ Furnishing_Rank : num 2 3 2 2 3 2 2 2 2 3 ...
## $ Terrace : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Condominium : num 0 0 1 0 0 1 1 0 0 0 ...
## $ Serviced Residence: num 0 1 0 0 1 0 0 1 1 1 ...
## $ Townhouse : num 0 0 0 0 0 0 0 0 0 0 ...
head(DF)
## Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1 2 1.000000 904 425.8850
## 2 1 1.000000 520 942.3077
## 3 2 2.000000 1236 493.5275
## 4 2 1.148069 880 636.3636
## 5 1 1.000000 613 750.4078
## 6 2 2.000000 1550 374.1935
## Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1 Seputeh 0 0 0 0 0
## 2 Lembah Pantai 0 0 0 0 0
## 3 Seputeh 0 0 0 0 0
## 4 Kepong 0 0 0 0 1
## 5 Titiwangsa 0 1 0 0 0
## 6 Segambut North 0 0 0 0 0
## Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1 0 0 0 1 0 0
## 2 0 0 1 0 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 1 0 0 0 0
## Number_Bedrooms Furnishing_Rank Terrace Condominium Serviced Residence
## 1 3.0 2 0 0 0
## 2 1.0 3 0 0 1
## 3 3.5 2 0 1 0
## 4 3.0 2 1 0 0
## 5 1.0 3 0 0 1
## 6 4.5 2 0 1 0
## Townhouse
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
4. Exploratory Data Analysis
First, we detected the outliers in Price per sqft based on Districts. After accessing the values of the outliers, we removed them as the values are too low (ie. below RM 10 per sqft) and hence are assumed to be errors in the initial dataset.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(lattice)
library(corrplot)
## corrplot 0.84 loaded
#4.1 Detect outliers in Price per sqft based on Districts
boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)") #view initial boxplot
outliers<-boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)")$out #identify outliers and save in new variable "outliers"
dim(DF) #display dimension before removing outliers, ie. 12,789 observations
## [1] 12789 22
DF <- DF[-which(DF$Price_persqft_RM %in% outliers),] #remove outliers from dataset
dim(DF) #after removing outliers, we now have 12,220 observations
## [1] 12220 22
boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)") #to display boxplot after outliers were eliminated
#4.2 Bivariate correlation matrix
head(DF)
## Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1 2 1.000000 904 425.8850
## 2 1 1.000000 520 942.3077
## 3 2 2.000000 1236 493.5275
## 4 2 1.148069 880 636.3636
## 5 1 1.000000 613 750.4078
## 6 2 2.000000 1550 374.1935
## Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1 Seputeh 0 0 0 0 0
## 2 Lembah Pantai 0 0 0 0 0
## 3 Seputeh 0 0 0 0 0
## 4 Kepong 0 0 0 0 1
## 5 Titiwangsa 0 1 0 0 0
## 6 Segambut North 0 0 0 0 0
## Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1 0 0 0 1 0 0
## 2 0 0 1 0 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 1 0 0 0 0
## Number_Bedrooms Furnishing_Rank Terrace Condominium Serviced Residence
## 1 3.0 2 0 0 0
## 2 1.0 3 0 0 1
## 3 3.5 2 0 1 0
## 4 3.0 2 1 0 0
## 5 1.0 3 0 0 1
## 6 4.5 2 0 1 0
## Townhouse
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
DF[,c(1)]<-as.numeric(DF[,c(1)])
DF[,c(2)]<-as.numeric(DF[,c(2)])
DF[,c(3)]<-as.numeric(DF[,c(3)])
DF[,c(4)]<-as.numeric(DF[,c(4)])
DF[,c(17)]<-as.numeric(DF[,c(17)])
DF[,c(18)]<-as.numeric(DF[,c(18)])
correlations <- cor(DF[,c(1,2,3,17,18,4)], use="everything")
corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
#4.3 Correlation - Scatter Plot
ggpairs(DF[,c(1,2,3,17,18,4)])+ theme_bw()
#4.4 Frequency of Listings based on Districts
barchart(DF$Districts, #This specifies the dataset and the variable
horizontal = FALSE, #Turn the bars so they are vertical
main = "Frequency of listings based on location", #Give your chart a title
xlab = "Districts", #Label the x axis
ylab = "Frequency", #Label the y axis
col = "darkgreen") #change the color of the bars
prop<-data.frame(prop.table(table(DF$Districts)))
prop
## Var1 Freq
## 1 Bandar Tun Razak 0.04484452
## 2 Batu 0.03527005
## 3 Bukit Bintang 0.08109656
## 4 Cheras 0.14075286
## 5 Kepong 0.06440262
## 6 Lembah Pantai 0.02381342
## 7 Segambut North 0.17847791
## 8 Segambut South 0.11227496
## 9 Seputeh 0.26309329
## 10 Titiwangsa 0.02094926
## 11 Wangsa Maju 0.03502455
5.Regression:
We had split the dataset into two ie. training and test datasets, using 80:20. Next, we built the linar multiple regression model using the training dataset.
#To split dataset into training and test using 80:20
library(lattice)
library(caret)
library(dplyr)
library(klaR)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(e1071)
library(mlbench)
DF_Regression <-(DF[,-c(5,12)])
head(DF_Regression)
## Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1 2 1.000000 904 425.8850
## 2 1 1.000000 520 942.3077
## 3 2 2.000000 1236 493.5275
## 4 2 1.148069 880 636.3636
## 5 1 1.000000 613 750.4078
## 6 2 2.000000 1550 374.1935
## Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Lembah.Pantai
## 1 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 1
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 1 0 0
## 5 0 1 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms Furnishing_Rank Terrace
## 1 1 0 0 3.0 2 0
## 2 0 0 0 1.0 3 0
## 3 1 0 0 3.5 2 0
## 4 0 0 0 3.0 2 1
## 5 0 0 0 1.0 3 0
## 6 0 0 0 4.5 2 0
## Condominium Serviced Residence Townhouse
## 1 0 0 0
## 2 0 1 0
## 3 1 0 0
## 4 0 0 0
## 5 0 1 0
## 6 1 0 0
set.seed(3333)
split <- 0.8
trainIndex_regression <- createDataPartition(DF_Regression$Price_persqft_RM, p = split, list= FALSE)
data_train_LM <- DF_Regression[trainIndex_regression, ]
data_test_LM <- DF_Regression[-trainIndex_regression,]
#Build linear multiple regression model
regression <- lm(Price_persqft_RM ~., data = data_train_LM)
summary(regression)
##
## Call:
## lm(formula = Price_persqft_RM ~ ., data = data_train_LM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -740.74 -65.48 -2.10 59.45 921.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 673.177251 8.420525 79.945 < 2e-16 ***
## Number_Bathrooms -22.572967 2.546608 -8.864 < 2e-16 ***
## Number_CarParks 17.148592 1.716103 9.993 < 2e-16 ***
## Size_TotalSqft -0.001780 0.000175 -10.173 < 2e-16 ***
## Bukit.Bintang 128.053340 4.518751 28.338 < 2e-16 ***
## Titiwangsa 40.680972 7.714082 5.274 1.37e-07 ***
## Wangsa.Maju 25.631167 5.952054 4.306 1.68e-05 ***
## Batu 38.966579 5.941624 6.558 5.72e-11 ***
## Kepong -7.833270 4.871376 -1.608 0.1079
## Segambut.South 46.841965 3.921021 11.946 < 2e-16 ***
## Lembah.Pantai 154.078347 7.245911 21.264 < 2e-16 ***
## Seputeh 16.398016 3.217326 5.097 3.52e-07 ***
## Bandar.Tun.Razak 13.270952 5.470262 2.426 0.0153 *
## Cheras 22.739328 3.701070 6.144 8.36e-10 ***
## Number_Bedrooms -92.568454 1.867058 -49.580 < 2e-16 ***
## Furnishing_Rank 11.270033 1.525214 7.389 1.60e-13 ***
## Terrace 98.986902 5.670514 17.456 < 2e-16 ***
## Condominium 38.449318 4.636902 8.292 < 2e-16 ***
## `Serviced Residence` 122.574729 4.943251 24.796 < 2e-16 ***
## Townhouse -26.420706 10.928812 -2.418 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.7 on 9757 degrees of freedom
## Multiple R-squared: 0.5823, Adjusted R-squared: 0.5815
## F-statistic: 715.8 on 19 and 9757 DF, p-value: < 2.2e-16
pred_regression <- predict(regression, data_test_LM)
error <- data_test_LM$Price_persqft_RM - pred_regression
RMSE <- sqrt(mean(error**2))
RMSE
## [1] 99.8265
From the output above, most of the independent variables are statistically significant. The model is able to explain 58% (R-squared) of the variation in property’s price per square feet.
6.Classification:
#Step 1: Create a additional variable to be classified as overvalued, average or undervalued
DFML <- DF %>%
group_by(Districts) %>%
mutate(location_mean = mean(Price_persqft_RM))
#A property is considered overvalued if the price psf if 15% higher than the district mean price psf and vice versa.
#Subsequently, a property is ranked accordingly with 1 is undervalued, 2 is average valued and 3 is overvalued.
DFML$deviation <- 2
DFML$deviation[DFML$Price_persqft_RM > 1.15*DFML$location_mean] <- 3
DFML$deviation[DFML$Price_persqft_RM < 0.85*DFML$location_mean] <- 1
DFML <- DFML[,-c(3,4,5,12,23)]
#Step 2: Change variables to factor
DFML$Bukit.Bintang <- as.factor(DFML$Bukit.Bintang)
DFML$Titiwangsa <- as.factor(DFML$Titiwangsa)
DFML$Wangsa.Maju <- as.factor(DFML$Wangsa.Maju)
DFML$Batu <- as.factor(DFML$Batu)
DFML$Kepong <- as.factor(DFML$Kepong)
DFML$Segambut.South <- as.factor(DFML$Segambut.South)
DFML$Lembah.Pantai <- as.factor(DFML$Lembah.Pantai)
DFML$Seputeh <- as.factor(DFML$Seputeh)
DFML$Bandar.Tun.Razak <- as.factor(DFML$Bandar.Tun.Razak)
DFML$Cheras <- as.factor(DFML$Cheras)
DFML$Furnishing_Rank <- as.factor(DFML$Furnishing_Rank)
DFML$Terrace <- as.factor(DFML$Terrace)
DFML$Condominium <- as.factor(DFML$Condominium)
DFML$`Serviced Residence` <- as.factor(DFML$`Serviced Residence`)
DFML$Townhouse <- as.factor(DFML$Townhouse)
DFML$deviation <- as.factor(DFML$deviation)
#Step 3: Train the KNN model with 80:20 split for training =:testing and fold set at 5 to avoid overfitting.
set.seed(3333)
split <- 0.8
trainIndex <- createDataPartition(DFML$deviation, p = split, list= FALSE)
data_train <- DFML[trainIndex, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
data_test <- DFML[-trainIndex,]
trctrl <- trainControl(method = "cv", number = 5)
model <- train(deviation~., method = "knn", data = data_train, metric = "Accuracy", trControl = trctrl)
model
## k-Nearest Neighbors
##
## 9777 samples
## 18 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 7822, 7822, 7822, 7821, 7821
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6828276 0.4996129
## 7 0.6824194 0.4984215
## 9 0.6799634 0.4939164
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
#Step 4: Prediction against data_test and evaluate the result using confusion matrix.
predictions <- predict(model, data_test)
confusionMatrix(predictions,data_test$deviation)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 513 219 31
## 2 237 783 182
## 3 16 89 373
##
## Overall Statistics
##
## Accuracy : 0.6832
## 95% CI : (0.6643, 0.7016)
## No Information Rate : 0.4466
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5014
##
## Mcnemar's Test P-Value : 3.763e-08
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.6697 0.7177 0.6365
## Specificity 0.8509 0.6901 0.9435
## Pos Pred Value 0.6723 0.6514 0.7803
## Neg Pred Value 0.8494 0.7518 0.8916
## Prevalence 0.3135 0.4466 0.2399
## Detection Rate 0.2100 0.3205 0.1527
## Detection Prevalence 0.3123 0.4920 0.1957
## Balanced Accuracy 0.7603 0.7039 0.7900
Conclusion
Question 1 From our regression model, the independent variables are able to explain 58% (R-squared) of the variation in property’s price per square feet and most of the independent variables ranging from number of rooms to its district are statistically significant. Note that the number of bathrooms and bedrooms have negative coefficients which reduced the property’s price psf. The rationale is likely that the increase in size of the property outgrow the overall property’s price as developers are able to procure cheaper material cost given the extra quantity. On the districts, properties are generally priced higher than the properties located in North of Segambut, except for Kepong.
Question 2 Our KNN model generally predicted correctly 68.6% of the overall test data. As the p-value [Acc >NIR] is significant, our model is able to add value to the prediction as compared prediction with no information at all. Therefore, it is safe to say that our model is able to correctly predict an overvalued or undervalued property more than half of the time given its district, type of property and additional features as input.