By:
Wendy Chong Pooi Mun, S2019483
Soong Woei Meng, 17147287



Introduction:
The real estate sector is highly subject to fluctuations in prices due to its correlations with numerous variables. Property prices may increase rapidly, or in some cases may also fall quickly, yet the online sites of property listings are not likely to be updated on a frequent basis. When the listed price of a property is below its actual market price by a certain magnitude, we are presented with a profitable investment opportunity.

Contents:

1. Initial Dataset

2. Problem Statement

3. Pre-processing

4. Exploratory Data Analysis
4.1 Identifying Outliers
4.2 Bivariate Correlation Matrix
4.3 Correlation-Scatter Plot
4.4 Frequency of Listings based on Districts

5. Regression

6. Classification

7. Conclusion


1. Initial Dataset:

Title: Property Listings in Kuala Lumpur
Year: July 2019
Purpose of Dataset: To provide data for comparison between properties in Kuala Lumpur
Dimension: Num.Rows:52,604 Num.Columns:8
Source: https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur


library(readr)
library(knitr)
DF <- read.csv(file = 'data.csv', header = TRUE) #import data csv
DF <- as.data.frame(DF)
head(DF) 
##                         ï..Location        Price Rooms Bathrooms Car.Parks
## 1                KLCC, Kuala Lumpur RM 1,250,000   2+1         3         2
## 2   Damansara Heights, Kuala Lumpur RM 6,800,000     6         7        NA
## 3             Dutamas, Kuala Lumpur RM 1,030,000     3         4         2
## 4         Bukit Jalil, Kuala Lumpur   RM 900,000   4+1         3         2
## 5 Taman Tun Dr Ismail, Kuala Lumpur RM 5,350,000   4+2         5         4
## 6 Taman Tun Dr Ismail, Kuala Lumpur RM 2,600,000     5         4         4
##          Property.Type                     Size       Furnishing
## 1   Serviced Residence Built-up : 1,335 sq. ft.  Fully Furnished
## 2             Bungalow Land area : 6900 sq. ft. Partly Furnished
## 3 Condominium (Corner) Built-up : 1,875 sq. ft. Partly Furnished
## 4 Condominium (Corner) Built-up : 1,513 sq. ft. Partly Furnished
## 5             Bungalow Land area : 7200 sq. ft. Partly Furnished
## 6  Semi-detached House Land area : 3600 sq. ft. Partly Furnished
dim(DF) #Dimension: 52,604 rows and 8 variables
## [1] 52604     8
str(DF) #To access structure of DF
## 'data.frame':    52604 obs. of  8 variables:
##  $ ï..Location  : chr  "KLCC, Kuala Lumpur" "Damansara Heights, Kuala Lumpur" "Dutamas, Kuala Lumpur" "Bukit Jalil, Kuala Lumpur" ...
##  $ Price        : chr  "RM 1,250,000" "RM 6,800,000" "RM 1,030,000" "RM 900,000" ...
##  $ Rooms        : chr  "2+1" "6" "3" "4+1" ...
##  $ Bathrooms    : int  3 7 4 3 5 4 4 2 3 3 ...
##  $ Car.Parks    : int  2 NA 2 2 4 4 3 1 NA NA ...
##  $ Property.Type: chr  "Serviced Residence" "Bungalow" "Condominium (Corner)" "Condominium (Corner)" ...
##  $ Size         : chr  "Built-up : 1,335 sq. ft." "Land area : 6900 sq. ft." "Built-up : 1,875 sq. ft." "Built-up : 1,513 sq. ft." ...
##  $ Furnishing   : chr  "Fully Furnished" "Partly Furnished" "Partly Furnished" "Partly Furnished" ...
summary(DF) #To access length and class for categorical variables; and display descriptive statistics for numerical variables
##  ï..Location           Price              Rooms             Bathrooms     
##  Length:52604       Length:52604       Length:52604       Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 2.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 3.000  
##                                                           Mean   : 3.083  
##                                                           3rd Qu.: 4.000  
##                                                           Max.   :20.000  
##                                                           NA's   :1832    
##    Car.Parks      Property.Type          Size            Furnishing       
##  Min.   : 1.000   Length:52604       Length:52604       Length:52604      
##  1st Qu.: 1.000   Class :character   Class :character   Class :character  
##  Median : 2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 2.012                                                           
##  3rd Qu.: 2.000                                                           
##  Max.   :30.000                                                           
##  NA's   :17085

2. Problem Statement:

Questions to be answered
1) How much is a property’s price per square feet influenced by its features (e.g. number of bedrooms, location and type of property)?
2) How to determine if a property is overvalued or undervalued given its features?

To answer question 1, we used the linear multiple regression model. Whilst for question 2, we adopted the k-nearest neighbors algorithm.

Therefore, for variables (e.g. bathrooms and size), we need to remove all strings and coerce the data into numeric data for the model. Some variables such as Furnishing will be ranked accordingly based on market convention that fully furnished often offers better value than partly furnished. Also, dummy variables will be created for the location and property type as they are nominal data. Notably, the locations will be grouped into their respective districts for practical purpose. Finally, new variables will also be created as the dependent variables.

3. Pre-processing:

#load packages to clean dataset
library(tidyr)
library(stringi)
library(stringr)
library(knitr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Price of properties:
After converting strings of the prices into numerics, we have filtered properties with a price range between RM 350,000 and RM 650,000. The rationale is that these mid-range priced properties appeal better to the majority of investors in terms of risk-return appetite.

#Convert strings of "Price" to numerical values and store in new variable "Price_RM"
DF$Price_RM <- str_remove_all(DF$Price,"[a-zA-Z ]")
DF$Price_RM <- as.numeric(str_remove_all(DF$Price_RM,"\\,"))
DF<-(DF[,(-2)])
head(DF)
##                         ï..Location Rooms Bathrooms Car.Parks
## 1                KLCC, Kuala Lumpur   2+1         3         2
## 2   Damansara Heights, Kuala Lumpur     6         7        NA
## 3             Dutamas, Kuala Lumpur     3         4         2
## 4         Bukit Jalil, Kuala Lumpur   4+1         3         2
## 5 Taman Tun Dr Ismail, Kuala Lumpur   4+2         5         4
## 6 Taman Tun Dr Ismail, Kuala Lumpur     5         4         4
##          Property.Type                     Size       Furnishing Price_RM
## 1   Serviced Residence Built-up : 1,335 sq. ft.  Fully Furnished  1250000
## 2             Bungalow Land area : 6900 sq. ft. Partly Furnished  6800000
## 3 Condominium (Corner) Built-up : 1,875 sq. ft. Partly Furnished  1030000
## 4 Condominium (Corner) Built-up : 1,513 sq. ft. Partly Furnished   900000
## 5             Bungalow Land area : 7200 sq. ft. Partly Furnished  5350000
## 6  Semi-detached House Land area : 3600 sq. ft. Partly Furnished  2600000
dim(DF)
## [1] 52604     8
#Select Price_RM ranging between 350,000 and 650,000. Reason: Mid-range properties fit majority investors' appetite.

DF <- filter(DF, !is.na(Price_RM))
DF <- filter(DF, Price_RM > 350000)
DF <- filter(DF, Price_RM < 650000)
dim(DF) #noted observations have reduced to 12,839 from 52,604
## [1] 12839     8

Size of properties:
For the size of properties, in converting to numerics from strings, we have also performed standardisation to the elements displayed by using mainly the function separate().

#Convert strings of "Size" to numerical values and store in new variable "Size_sqft"
DF$Size_sqft <- str_remove_all(DF$Size,"\\,")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\.")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\:")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"\\-")

DF$Size_sqft <- str_remove_all(DF$Size_sqft,"[a-wA-W ]")
DF$Size_sqft <- str_remove_all(DF$Size_sqft,"[y-zY-Z ]")

#Next, we need to standardise the display formats of "Size_sqft". There are some values presented in text format of length X width, which we will standardise by showing only the numeric output after multiplication.
DF$Size_sqft <- str_replace(DF$Size_sqft,"[x-xX-X ]","X") #to standardise the separator

DF <- DF %>%
  separate(Size_sqft,c("first","second"),sep="X") 
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12009 rows [1, 2,
## 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, ...].
DF$second <- replace(DF$second, is.na(DF$second), 1)

DF <- DF[!(DF$first=="0"), ] #remove missing values & errors, rationale: our dependent variable pertains to price per square feet
DF <- DF[!(DF$first=="1"), ]
DF <- DF[!(DF$first=="/"), ]
DF <- DF[!(DF$first=="`1760"), ]
DF <- DF[!(DF$first==""), ]

DF <- transform(DF, first = as.numeric(first))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
typeof(DF$first)
## [1] "double"
DF <- DF[- grep("\\+", DF$second),]
DF <- DF[!(DF$second==""), ]
DF <- DF[!(DF$second=="0"), ]
DF <- transform(DF, second = as.numeric(second))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(DF)
##                   ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling, Kuala Lumpur     3         2         1
## 2 Bangsar South, Kuala Lumpur     1         1         1
## 3   Bukit Jalil, Kuala Lumpur   3+1         2         2
## 4        Kepong, Kuala Lumpur     3         2        NA
## 5  Ampang Hilir, Kuala Lumpur     1         1         1
## 6       Setapak, Kuala Lumpur   4+1         2         2
##                             Property.Type                      Size
## 1                Apartment (Intermediate)    Built-up : 904 sq. ft.
## 2                      Serviced Residence    Built-up : 520 sq. ft.
## 3              Condominium (Intermediate)  Built-up : 1,236 sq. ft.
## 4 2-sty Terrace/Link House (Intermediate) Land area : 16x55 sq. ft.
## 5                      Serviced Residence    Built-up : 613 sq. ft.
## 6                    Condominium (Corner)  Built-up : 1,550 sq. ft.
##         Furnishing Price_RM first second
## 1 Partly Furnished   385000   904      1
## 2  Fully Furnished   490000   520      1
## 3 Partly Furnished   610000  1236      1
## 4 Partly Furnished   560000    16     55
## 5  Fully Furnished   460000   613      1
## 6 Partly Furnished   580000  1550      1
Size_TotalSqft = DF$first * DF$second #to multiply length X width in sqft, and store products in new variable "Size_TotalSqft"
DF=cbind(DF,Size_TotalSqft) #to bind new variable "Size_TotalSqft" with DF
DF<-(DF[,-c(6,9,10)]) #Remove variables which are no longer required
head(DF)
##                   ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling, Kuala Lumpur     3         2         1
## 2 Bangsar South, Kuala Lumpur     1         1         1
## 3   Bukit Jalil, Kuala Lumpur   3+1         2         2
## 4        Kepong, Kuala Lumpur     3         2        NA
## 5  Ampang Hilir, Kuala Lumpur     1         1         1
## 6       Setapak, Kuala Lumpur   4+1         2         2
##                             Property.Type       Furnishing Price_RM
## 1                Apartment (Intermediate) Partly Furnished   385000
## 2                      Serviced Residence  Fully Furnished   490000
## 3              Condominium (Intermediate) Partly Furnished   610000
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished   560000
## 5                      Serviced Residence  Fully Furnished   460000
## 6                    Condominium (Corner) Partly Furnished   580000
##   Size_TotalSqft
## 1            904
## 2            520
## 3           1236
## 4            880
## 5            613
## 6           1550

Our interest is in the price of property per square feet. Hence, we have created the new variable “Price_persqft_RM” as follow:

Price_persqft_RM = DF$Price_RM / DF$Size_TotalSqft #
DF=cbind(DF,Price_persqft_RM)
head(DF)
##                   ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling, Kuala Lumpur     3         2         1
## 2 Bangsar South, Kuala Lumpur     1         1         1
## 3   Bukit Jalil, Kuala Lumpur   3+1         2         2
## 4        Kepong, Kuala Lumpur     3         2        NA
## 5  Ampang Hilir, Kuala Lumpur     1         1         1
## 6       Setapak, Kuala Lumpur   4+1         2         2
##                             Property.Type       Furnishing Price_RM
## 1                Apartment (Intermediate) Partly Furnished   385000
## 2                      Serviced Residence  Fully Furnished   490000
## 3              Condominium (Intermediate) Partly Furnished   610000
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished   560000
## 5                      Serviced Residence  Fully Furnished   460000
## 6                    Condominium (Corner) Partly Furnished   580000
##   Size_TotalSqft Price_persqft_RM
## 1            904         425.8850
## 2            520         942.3077
## 3           1236         493.5275
## 4            880         636.3636
## 5            613         750.4078
## 6           1550         374.1935
DF<-(DF[,-c(7)]) #remove the variable of "Price_RM"
head(DF)
##                   ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling, Kuala Lumpur     3         2         1
## 2 Bangsar South, Kuala Lumpur     1         1         1
## 3   Bukit Jalil, Kuala Lumpur   3+1         2         2
## 4        Kepong, Kuala Lumpur     3         2        NA
## 5  Ampang Hilir, Kuala Lumpur     1         1         1
## 6       Setapak, Kuala Lumpur   4+1         2         2
##                             Property.Type       Furnishing Size_TotalSqft
## 1                Apartment (Intermediate) Partly Furnished            904
## 2                      Serviced Residence  Fully Furnished            520
## 3              Condominium (Intermediate) Partly Furnished           1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished            880
## 5                      Serviced Residence  Fully Furnished            613
## 6                    Condominium (Corner) Partly Furnished           1550
##   Price_persqft_RM
## 1         425.8850
## 2         942.3077
## 3         493.5275
## 4         636.3636
## 5         750.4078
## 6         374.1935
dim(DF)
## [1] 12806     8
DF<-filter(DF,!is.na(DF$Price_persqft_RM)) #remove missing values, rationale: our dependent variable pertains to price
dim(DF)
## [1] 12789     8

Property Locations:
Here, we have mainly processed the data by classifying the 54 locations into 11 districts.

#Since all the listings are for properties in Kuala Lumpur, we can remove the string ", Kuala Lumpur"
DF$ï..Location <- str_remove_all(DF$ï..Location,", Kuala Lumpur")
head(DF)
##     ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling     3         2         1
## 2 Bangsar South     1         1         1
## 3   Bukit Jalil   3+1         2         2
## 4        Kepong     3         2        NA
## 5  Ampang Hilir     1         1         1
## 6       Setapak   4+1         2         2
##                             Property.Type       Furnishing Size_TotalSqft
## 1                Apartment (Intermediate) Partly Furnished            904
## 2                      Serviced Residence  Fully Furnished            520
## 3              Condominium (Intermediate) Partly Furnished           1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished            880
## 5                      Serviced Residence  Fully Furnished            613
## 6                    Condominium (Corner) Partly Furnished           1550
##   Price_persqft_RM
## 1         425.8850
## 2         942.3077
## 3         493.5275
## 4         636.3636
## 5         750.4078
## 6         374.1935
#Classify locations into districts to ease analysis
Locations_Unq<-unique(DF$ï..Location) #to remove duplicated values from Location and store in new vector
write.csv(Locations_Unq, "Locations_Unq.csv") #to export the new vector into csv file for us to classify into districts

#Import csv file containing districts which correspond to each location
Districts<-read.csv(file = 'Districts.csv', header = TRUE)
head(Districts)
##            ï..x      Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1  Sri Petaling        Seputeh             0          0           0    0      0
## 2 Bangsar South  Lembah Pantai             0          0           0    0      0
## 3   Bukit Jalil        Seputeh             0          0           0    0      0
## 4        Kepong         Kepong             0          0           0    0      1
## 5  Ampang Hilir     Titiwangsa             0          1           0    0      0
## 6       Setapak Segambut North             0          0           0    0      0
##   Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1              0              0             0       1                0      0
## 2              0              0             1       0                0      0
## 3              0              0             0       1                0      0
## 4              0              0             0       0                0      0
## 5              0              0             0       0                0      0
## 6              0              1             0       0                0      0
#Merge the data frame of "Districts" with "DF"
DF<-left_join(DF, Districts, by = c("ï..Location" = "ï..x"),all.x=TRUE)
head(DF)
##     ï..Location Rooms Bathrooms Car.Parks
## 1  Sri Petaling     3         2         1
## 2 Bangsar South     1         1         1
## 3   Bukit Jalil   3+1         2         2
## 4        Kepong     3         2        NA
## 5  Ampang Hilir     1         1         1
## 6       Setapak   4+1         2         2
##                             Property.Type       Furnishing Size_TotalSqft
## 1                Apartment (Intermediate) Partly Furnished            904
## 2                      Serviced Residence  Fully Furnished            520
## 3              Condominium (Intermediate) Partly Furnished           1236
## 4 2-sty Terrace/Link House (Intermediate) Partly Furnished            880
## 5                      Serviced Residence  Fully Furnished            613
## 6                    Condominium (Corner) Partly Furnished           1550
##   Price_persqft_RM      Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu
## 1         425.8850        Seputeh             0          0           0    0
## 2         942.3077  Lembah Pantai             0          0           0    0
## 3         493.5275        Seputeh             0          0           0    0
## 4         636.3636         Kepong             0          0           0    0
## 5         750.4078     Titiwangsa             0          1           0    0
## 6         374.1935 Segambut North             0          0           0    0
##   Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak
## 1      0              0              0             0       1                0
## 2      0              0              0             1       0                0
## 3      0              0              0             0       1                0
## 4      1              0              0             0       0                0
## 5      0              0              0             0       0                0
## 6      0              0              1             0       0                0
##   Cheras
## 1      0
## 2      0
## 3      0
## 4      0
## 5      0
## 6      0
#Remove Locations variable
DF<-(DF[,-c(1)])
head(DF)   #instead of having 54 Locations per initial dataset, we now only need to focus on 11 districts
##   Rooms Bathrooms Car.Parks                           Property.Type
## 1     3         2         1                Apartment (Intermediate)
## 2     1         1         1                      Serviced Residence
## 3   3+1         2         2              Condominium (Intermediate)
## 4     3         2        NA 2-sty Terrace/Link House (Intermediate)
## 5     1         1         1                      Serviced Residence
## 6   4+1         2         2                    Condominium (Corner)
##         Furnishing Size_TotalSqft Price_persqft_RM      Districts Bukit.Bintang
## 1 Partly Furnished            904         425.8850        Seputeh             0
## 2  Fully Furnished            520         942.3077  Lembah Pantai             0
## 3 Partly Furnished           1236         493.5275        Seputeh             0
## 4 Partly Furnished            880         636.3636         Kepong             0
## 5  Fully Furnished            613         750.4078     Titiwangsa             0
## 6 Partly Furnished           1550         374.1935 Segambut North             0
##   Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Segambut.North
## 1          0           0    0      0              0              0
## 2          0           0    0      0              0              0
## 3          0           0    0      0              0              0
## 4          0           0    0      1              0              0
## 5          1           0    0      0              0              0
## 6          0           0    0      0              0              1
##   Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1             0       1                0      0
## 2             1       0                0      0
## 3             0       1                0      0
## 4             0       0                0      0
## 5             0       0                0      0
## 6             0       0                0      0

To address the missing values
To imputate the missing values for number of bedrooms, bathrooms and car parks, we used the linear regression model.

DF <- DF %>%
  separate(Rooms,c("Bedrooms","second"),sep="\\+")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 11159 rows [1, 2,
## 4, 5, 7, 8, 11, 12, 13, 14, 17, 18, 19, 20, 22, 24, 26, 27, 28, 29, ...].
DF$Bedrooms <- replace(DF$Bedrooms, DF$Bedrooms=="Studio", 1)
DF$Bedrooms <- as.numeric(DF$Bedrooms)

DF$second<-replace(DF$second, is.na(DF$second), 0)
DF$second<-replace(DF$second, DF$second == 1, 0.5)
DF$second<-replace(DF$second, DF$second == 2, 1)
DF$second<-as.numeric(DF$second)
DF$Number_Bedrooms <- DF$Bedrooms + DF$second

head(DF)
##   Bedrooms second Bathrooms Car.Parks                           Property.Type
## 1        3    0.0         2         1                Apartment (Intermediate)
## 2        1    0.0         1         1                      Serviced Residence
## 3        3    0.5         2         2              Condominium (Intermediate)
## 4        3    0.0         2        NA 2-sty Terrace/Link House (Intermediate)
## 5        1    0.0         1         1                      Serviced Residence
## 6        4    0.5         2         2                    Condominium (Corner)
##         Furnishing Size_TotalSqft Price_persqft_RM      Districts Bukit.Bintang
## 1 Partly Furnished            904         425.8850        Seputeh             0
## 2  Fully Furnished            520         942.3077  Lembah Pantai             0
## 3 Partly Furnished           1236         493.5275        Seputeh             0
## 4 Partly Furnished            880         636.3636         Kepong             0
## 5  Fully Furnished            613         750.4078     Titiwangsa             0
## 6 Partly Furnished           1550         374.1935 Segambut North             0
##   Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Segambut.North
## 1          0           0    0      0              0              0
## 2          0           0    0      0              0              0
## 3          0           0    0      0              0              0
## 4          0           0    0      1              0              0
## 5          1           0    0      0              0              0
## 6          0           0    0      0              0              1
##   Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms
## 1             0       1                0      0             3.0
## 2             1       0                0      0             1.0
## 3             0       1                0      0             3.5
## 4             0       0                0      0             3.0
## 5             0       0                0      0             1.0
## 6             0       0                0      0             4.5
DF <- DF[,-c(1,2)]
head(DF)
##   Bathrooms Car.Parks                           Property.Type       Furnishing
## 1         2         1                Apartment (Intermediate) Partly Furnished
## 2         1         1                      Serviced Residence  Fully Furnished
## 3         2         2              Condominium (Intermediate) Partly Furnished
## 4         2        NA 2-sty Terrace/Link House (Intermediate) Partly Furnished
## 5         1         1                      Serviced Residence  Fully Furnished
## 6         2         2                    Condominium (Corner) Partly Furnished
##   Size_TotalSqft Price_persqft_RM      Districts Bukit.Bintang Titiwangsa
## 1            904         425.8850        Seputeh             0          0
## 2            520         942.3077  Lembah Pantai             0          0
## 3           1236         493.5275        Seputeh             0          0
## 4            880         636.3636         Kepong             0          0
## 5            613         750.4078     Titiwangsa             0          1
## 6           1550         374.1935 Segambut North             0          0
##   Wangsa.Maju Batu Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh
## 1           0    0      0              0              0             0       1
## 2           0    0      0              0              0             1       0
## 3           0    0      0              0              0             0       1
## 4           0    0      1              0              0             0       0
## 5           0    0      0              0              0             0       0
## 6           0    0      0              0              1             0       0
##   Bandar.Tun.Razak Cheras Number_Bedrooms
## 1                0      0             3.0
## 2                0      0             1.0
## 3                0      0             3.5
## 4                0      0             3.0
## 5                0      0             1.0
## 6                0      0             4.5
#For missing values in Number_Bedrooms, we imputated them using simple linear regression method
DF$Number_Bedrooms<-replace(DF$Number_Bedrooms, is.na(DF$Number_Bedrooms), 0)

lm(Number_Bedrooms ~ Size_TotalSqft, DF) #regress y=Number_Bedrooms against x=Size_TotalSqft
## 
## Call:
## lm(formula = Number_Bedrooms ~ Size_TotalSqft, data = DF)
## 
## Coefficients:
##    (Intercept)  Size_TotalSqft  
##      2.768e+00       3.050e-09
for(i in 1:nrow(DF))
{
  if(DF$Number_Bedrooms[i] == 0)
  {
    DF$Number_Bedrooms[i] =  2.768e+00  + 3.050e-09 *DF$Size_TotalSqft[i]
  }
}
#For missing values in Bathrooms, we imputated them using simple linear regression method
DF$Bathrooms <- replace(DF$Bathrooms, is.na(DF$Bathrooms), 0)

lm(Bathrooms ~ Size_TotalSqft, DF) #y=Bathrooms against x=Size_TotalSqft
## 
## Call:
## lm(formula = Bathrooms ~ Size_TotalSqft, data = DF)
## 
## Coefficients:
##    (Intercept)  Size_TotalSqft  
##      1.967e+00       2.121e-08
for(i in 1:nrow(DF))
{
  if(DF$Bathrooms[i] == 0)
  {
    DF$Bathrooms[i] =  1.967e+00  + 2.121e-08 *DF$Size_TotalSqft[i]
  }
}
DF$Car.Parks <- replace(DF$Car.Parks, is.na(DF$Car.Parks), 0)
lm(Car.Parks ~ Size_TotalSqft, DF)
## 
## Call:
## lm(formula = Car.Parks ~ Size_TotalSqft, data = DF)
## 
## Coefficients:
##    (Intercept)  Size_TotalSqft  
##      1.148e+00       7.843e-08
for(i in 1:nrow(DF))
{
  if(DF$Car.Parks[i] == 0)
  {
    DF$Car.Parks[i] =   1.148e+00   + 7.843e-08 *DF$Size_TotalSqft[i]
  }
}
 head(DF)
##   Bathrooms Car.Parks                           Property.Type       Furnishing
## 1         2  1.000000                Apartment (Intermediate) Partly Furnished
## 2         1  1.000000                      Serviced Residence  Fully Furnished
## 3         2  2.000000              Condominium (Intermediate) Partly Furnished
## 4         2  1.148069 2-sty Terrace/Link House (Intermediate) Partly Furnished
## 5         1  1.000000                      Serviced Residence  Fully Furnished
## 6         2  2.000000                    Condominium (Corner) Partly Furnished
##   Size_TotalSqft Price_persqft_RM      Districts Bukit.Bintang Titiwangsa
## 1            904         425.8850        Seputeh             0          0
## 2            520         942.3077  Lembah Pantai             0          0
## 3           1236         493.5275        Seputeh             0          0
## 4            880         636.3636         Kepong             0          0
## 5            613         750.4078     Titiwangsa             0          1
## 6           1550         374.1935 Segambut North             0          0
##   Wangsa.Maju Batu Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh
## 1           0    0      0              0              0             0       1
## 2           0    0      0              0              0             1       0
## 3           0    0      0              0              0             0       1
## 4           0    0      1              0              0             0       0
## 5           0    0      0              0              0             0       0
## 6           0    0      0              0              1             0       0
##   Bandar.Tun.Razak Cheras Number_Bedrooms
## 1                0      0             3.0
## 2                0      0             1.0
## 3                0      0             3.5
## 4                0      0             3.0
## 5                0      0             1.0
## 6                0      0             4.5

We ranked the furnishing as follow:

DF$Furnishing_Rank <- 1
DF$Furnishing_Rank[DF$Furnishing == "Partly Furnished"] <- 2
DF$Furnishing_Rank[DF$Furnishing == "Fully Furnished"] <- 3
DF <- DF[,-4]
head(DF)
##   Bathrooms Car.Parks                           Property.Type Size_TotalSqft
## 1         2  1.000000                Apartment (Intermediate)            904
## 2         1  1.000000                      Serviced Residence            520
## 3         2  2.000000              Condominium (Intermediate)           1236
## 4         2  1.148069 2-sty Terrace/Link House (Intermediate)            880
## 5         1  1.000000                      Serviced Residence            613
## 6         2  2.000000                    Condominium (Corner)           1550
##   Price_persqft_RM      Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu
## 1         425.8850        Seputeh             0          0           0    0
## 2         942.3077  Lembah Pantai             0          0           0    0
## 3         493.5275        Seputeh             0          0           0    0
## 4         636.3636         Kepong             0          0           0    0
## 5         750.4078     Titiwangsa             0          1           0    0
## 6         374.1935 Segambut North             0          0           0    0
##   Kepong Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak
## 1      0              0              0             0       1                0
## 2      0              0              0             1       0                0
## 3      0              0              0             0       1                0
## 4      1              0              0             0       0                0
## 5      0              0              0             0       0                0
## 6      0              0              1             0       0                0
##   Cheras Number_Bedrooms Furnishing_Rank
## 1      0             3.0               2
## 2      0             1.0               3
## 3      0             3.5               2
## 4      0             3.0               2
## 5      0             1.0               3
## 6      0             4.5               2

For property types, we converted into dummy variables."

DF[grep("Terrace",DF$Property.Type,value = F),"Terrace"] <- as.numeric(1)
DF[grep("Condominium",DF$Property.Type,value = F),"Condominium"] <- as.numeric(1)
DF[grep("Serviced Residence",DF$Property.Type,value = F),"Serviced Residence"] <- as.numeric(1)
DF[grep("Townhouse",DF$Property.Type,value = F),"Townhouse"] <- as.numeric(1)

DF$Terrace <- replace(DF$Terrace, is.na(DF$Terrace), 0)
DF$Condominium <- replace(DF$Condominium, is.na(DF$Condominium), 0)
DF$`Serviced Residence` <- replace(DF$`Serviced Residence`, is.na(DF$`Serviced Residence`), 0)
DF$Townhouse <- replace(DF$Townhouse, is.na(DF$Townhouse), 0)

DF <- DF[,-3]
head(DF)
##   Bathrooms Car.Parks Size_TotalSqft Price_persqft_RM      Districts
## 1         2  1.000000            904         425.8850        Seputeh
## 2         1  1.000000            520         942.3077  Lembah Pantai
## 3         2  2.000000           1236         493.5275        Seputeh
## 4         2  1.148069            880         636.3636         Kepong
## 5         1  1.000000            613         750.4078     Titiwangsa
## 6         2  2.000000           1550         374.1935 Segambut North
##   Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong Segambut.South
## 1             0          0           0    0      0              0
## 2             0          0           0    0      0              0
## 3             0          0           0    0      0              0
## 4             0          0           0    0      1              0
## 5             0          1           0    0      0              0
## 6             0          0           0    0      0              0
##   Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms
## 1              0             0       1                0      0             3.0
## 2              0             1       0                0      0             1.0
## 3              0             0       1                0      0             3.5
## 4              0             0       0                0      0             3.0
## 5              0             0       0                0      0             1.0
## 6              1             0       0                0      0             4.5
##   Furnishing_Rank Terrace Condominium Serviced Residence Townhouse
## 1               2       0           0                  0         0
## 2               3       0           0                  1         0
## 3               2       0           1                  0         0
## 4               2       1           0                  0         0
## 5               3       0           0                  1         0
## 6               2       0           1                  0         0

Lastly, we renamed some variables.

names(DF)[names(DF)=="Bathrooms"]<-"Number_Bathrooms"
names(DF)[names(DF)=="Car.Parks"]<-"Number_CarParks"

Putting together all into a cleaned dataset:

sum(is.na(DF))
## [1] 0
str(DF)
## 'data.frame':    12789 obs. of  22 variables:
##  $ Number_Bathrooms  : num  2 1 2 2 1 2 2 2 1 1 ...
##  $ Number_CarParks   : num  1 1 2 1.15 1 ...
##  $ Size_TotalSqft    : num  904 520 1236 880 613 ...
##  $ Price_persqft_RM  : num  426 942 494 636 750 ...
##  $ Districts         : chr  "Seputeh" "Lembah Pantai" "Seputeh" "Kepong" ...
##  $ Bukit.Bintang     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Titiwangsa        : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Wangsa.Maju       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Batu              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Kepong            : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ Segambut.South    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Segambut.North    : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Lembah.Pantai     : int  0 1 0 0 0 0 0 0 1 1 ...
##  $ Seputeh           : int  1 0 1 0 0 0 1 1 0 0 ...
##  $ Bandar.Tun.Razak  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cheras            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Number_Bedrooms   : num  3 1 3.5 3 1 4.5 3 2 1.5 1.5 ...
##  $ Furnishing_Rank   : num  2 3 2 2 3 2 2 2 2 3 ...
##  $ Terrace           : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Condominium       : num  0 0 1 0 0 1 1 0 0 0 ...
##  $ Serviced Residence: num  0 1 0 0 1 0 0 1 1 1 ...
##  $ Townhouse         : num  0 0 0 0 0 0 0 0 0 0 ...
head(DF)
##   Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1                2        1.000000            904         425.8850
## 2                1        1.000000            520         942.3077
## 3                2        2.000000           1236         493.5275
## 4                2        1.148069            880         636.3636
## 5                1        1.000000            613         750.4078
## 6                2        2.000000           1550         374.1935
##        Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1        Seputeh             0          0           0    0      0
## 2  Lembah Pantai             0          0           0    0      0
## 3        Seputeh             0          0           0    0      0
## 4         Kepong             0          0           0    0      1
## 5     Titiwangsa             0          1           0    0      0
## 6 Segambut North             0          0           0    0      0
##   Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1              0              0             0       1                0      0
## 2              0              0             1       0                0      0
## 3              0              0             0       1                0      0
## 4              0              0             0       0                0      0
## 5              0              0             0       0                0      0
## 6              0              1             0       0                0      0
##   Number_Bedrooms Furnishing_Rank Terrace Condominium Serviced Residence
## 1             3.0               2       0           0                  0
## 2             1.0               3       0           0                  1
## 3             3.5               2       0           1                  0
## 4             3.0               2       1           0                  0
## 5             1.0               3       0           0                  1
## 6             4.5               2       0           1                  0
##   Townhouse
## 1         0
## 2         0
## 3         0
## 4         0
## 5         0
## 6         0

4. Exploratory Data Analysis
First, we detected the outliers in Price per sqft based on Districts. After accessing the values of the outliers, we removed them as the values are too low (ie. below RM 10 per sqft) and hence are assumed to be errors in the initial dataset.

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(lattice)
library(corrplot)
## corrplot 0.84 loaded
#4.1 Detect outliers in Price per sqft based on Districts
boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)") #view initial boxplot

outliers<-boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)")$out #identify outliers and save in new variable "outliers"

dim(DF) #display dimension before removing outliers, ie. 12,789 observations
## [1] 12789    22
DF <- DF[-which(DF$Price_persqft_RM %in% outliers),] #remove outliers from dataset
dim(DF) #after removing outliers, we now have 12,220 observations
## [1] 12220    22
boxplot(Price_persqft_RM ~ Districts, DF, xlab = "Districts", ylab = "Price_persqft(RM)") #to display boxplot after outliers were eliminated

#4.2 Bivariate correlation matrix
head(DF)
##   Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1                2        1.000000            904         425.8850
## 2                1        1.000000            520         942.3077
## 3                2        2.000000           1236         493.5275
## 4                2        1.148069            880         636.3636
## 5                1        1.000000            613         750.4078
## 6                2        2.000000           1550         374.1935
##        Districts Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong
## 1        Seputeh             0          0           0    0      0
## 2  Lembah Pantai             0          0           0    0      0
## 3        Seputeh             0          0           0    0      0
## 4         Kepong             0          0           0    0      1
## 5     Titiwangsa             0          1           0    0      0
## 6 Segambut North             0          0           0    0      0
##   Segambut.South Segambut.North Lembah.Pantai Seputeh Bandar.Tun.Razak Cheras
## 1              0              0             0       1                0      0
## 2              0              0             1       0                0      0
## 3              0              0             0       1                0      0
## 4              0              0             0       0                0      0
## 5              0              0             0       0                0      0
## 6              0              1             0       0                0      0
##   Number_Bedrooms Furnishing_Rank Terrace Condominium Serviced Residence
## 1             3.0               2       0           0                  0
## 2             1.0               3       0           0                  1
## 3             3.5               2       0           1                  0
## 4             3.0               2       1           0                  0
## 5             1.0               3       0           0                  1
## 6             4.5               2       0           1                  0
##   Townhouse
## 1         0
## 2         0
## 3         0
## 4         0
## 5         0
## 6         0
DF[,c(1)]<-as.numeric(DF[,c(1)])
DF[,c(2)]<-as.numeric(DF[,c(2)])
DF[,c(3)]<-as.numeric(DF[,c(3)])
DF[,c(4)]<-as.numeric(DF[,c(4)])
DF[,c(17)]<-as.numeric(DF[,c(17)])
DF[,c(18)]<-as.numeric(DF[,c(18)])
      
correlations <- cor(DF[,c(1,2,3,17,18,4)], use="everything")
corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

#4.3 Correlation - Scatter Plot
ggpairs(DF[,c(1,2,3,17,18,4)])+ theme_bw()

#4.4 Frequency of Listings based on Districts
barchart(DF$Districts, #This specifies the dataset and the variable
         horizontal = FALSE, #Turn the bars so they are vertical
         main = "Frequency of listings based on location", #Give your chart a title
         xlab = "Districts", #Label the x axis
         ylab = "Frequency", #Label the y axis 
         col = "darkgreen") #change the color of the bars

prop<-data.frame(prop.table(table(DF$Districts)))
prop
##                Var1       Freq
## 1  Bandar Tun Razak 0.04484452
## 2              Batu 0.03527005
## 3     Bukit Bintang 0.08109656
## 4            Cheras 0.14075286
## 5            Kepong 0.06440262
## 6     Lembah Pantai 0.02381342
## 7    Segambut North 0.17847791
## 8    Segambut South 0.11227496
## 9           Seputeh 0.26309329
## 10       Titiwangsa 0.02094926
## 11      Wangsa Maju 0.03502455

5.Regression:
We had split the dataset into two ie. training and test datasets, using 80:20. Next, we built the linar multiple regression model using the training dataset.

#To split dataset into training and test using 80:20

library(lattice)
library(caret)
library(dplyr)
library(klaR)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(e1071)
library(mlbench)

DF_Regression <-(DF[,-c(5,12)])
head(DF_Regression)
##   Number_Bathrooms Number_CarParks Size_TotalSqft Price_persqft_RM
## 1                2        1.000000            904         425.8850
## 2                1        1.000000            520         942.3077
## 3                2        2.000000           1236         493.5275
## 4                2        1.148069            880         636.3636
## 5                1        1.000000            613         750.4078
## 6                2        2.000000           1550         374.1935
##   Bukit.Bintang Titiwangsa Wangsa.Maju Batu Kepong Segambut.South Lembah.Pantai
## 1             0          0           0    0      0              0             0
## 2             0          0           0    0      0              0             1
## 3             0          0           0    0      0              0             0
## 4             0          0           0    0      1              0             0
## 5             0          1           0    0      0              0             0
## 6             0          0           0    0      0              0             0
##   Seputeh Bandar.Tun.Razak Cheras Number_Bedrooms Furnishing_Rank Terrace
## 1       1                0      0             3.0               2       0
## 2       0                0      0             1.0               3       0
## 3       1                0      0             3.5               2       0
## 4       0                0      0             3.0               2       1
## 5       0                0      0             1.0               3       0
## 6       0                0      0             4.5               2       0
##   Condominium Serviced Residence Townhouse
## 1           0                  0         0
## 2           0                  1         0
## 3           1                  0         0
## 4           0                  0         0
## 5           0                  1         0
## 6           1                  0         0
set.seed(3333)

split <- 0.8

trainIndex_regression <- createDataPartition(DF_Regression$Price_persqft_RM, p = split, list= FALSE)
data_train_LM <- DF_Regression[trainIndex_regression, ]
data_test_LM <- DF_Regression[-trainIndex_regression,]

#Build linear multiple regression model
regression <- lm(Price_persqft_RM ~., data = data_train_LM)
summary(regression)
## 
## Call:
## lm(formula = Price_persqft_RM ~ ., data = data_train_LM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -740.74  -65.48   -2.10   59.45  921.86 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          673.177251   8.420525  79.945  < 2e-16 ***
## Number_Bathrooms     -22.572967   2.546608  -8.864  < 2e-16 ***
## Number_CarParks       17.148592   1.716103   9.993  < 2e-16 ***
## Size_TotalSqft        -0.001780   0.000175 -10.173  < 2e-16 ***
## Bukit.Bintang        128.053340   4.518751  28.338  < 2e-16 ***
## Titiwangsa            40.680972   7.714082   5.274 1.37e-07 ***
## Wangsa.Maju           25.631167   5.952054   4.306 1.68e-05 ***
## Batu                  38.966579   5.941624   6.558 5.72e-11 ***
## Kepong                -7.833270   4.871376  -1.608   0.1079    
## Segambut.South        46.841965   3.921021  11.946  < 2e-16 ***
## Lembah.Pantai        154.078347   7.245911  21.264  < 2e-16 ***
## Seputeh               16.398016   3.217326   5.097 3.52e-07 ***
## Bandar.Tun.Razak      13.270952   5.470262   2.426   0.0153 *  
## Cheras                22.739328   3.701070   6.144 8.36e-10 ***
## Number_Bedrooms      -92.568454   1.867058 -49.580  < 2e-16 ***
## Furnishing_Rank       11.270033   1.525214   7.389 1.60e-13 ***
## Terrace               98.986902   5.670514  17.456  < 2e-16 ***
## Condominium           38.449318   4.636902   8.292  < 2e-16 ***
## `Serviced Residence` 122.574729   4.943251  24.796  < 2e-16 ***
## Townhouse            -26.420706  10.928812  -2.418   0.0156 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.7 on 9757 degrees of freedom
## Multiple R-squared:  0.5823, Adjusted R-squared:  0.5815 
## F-statistic: 715.8 on 19 and 9757 DF,  p-value: < 2.2e-16
pred_regression <- predict(regression, data_test_LM)
error <- data_test_LM$Price_persqft_RM - pred_regression
RMSE <- sqrt(mean(error**2))
RMSE
## [1] 99.8265

From the output above, most of the independent variables are statistically significant. The model is able to explain 58% (R-squared) of the variation in property’s price per square feet.

6.Classification:

#Step 1: Create a additional variable to be classified as overvalued, average or undervalued

DFML <- DF %>%
  group_by(Districts) %>%
  mutate(location_mean = mean(Price_persqft_RM))
#A property is considered overvalued if the price psf if 15% higher than the district mean price psf and vice versa. 
#Subsequently, a property is ranked accordingly with 1 is undervalued, 2 is average valued and 3 is overvalued. 

DFML$deviation <- 2
DFML$deviation[DFML$Price_persqft_RM > 1.15*DFML$location_mean] <- 3
DFML$deviation[DFML$Price_persqft_RM < 0.85*DFML$location_mean] <- 1
DFML <- DFML[,-c(3,4,5,12,23)]

#Step 2: Change variables to factor
DFML$Bukit.Bintang <- as.factor(DFML$Bukit.Bintang)
DFML$Titiwangsa <- as.factor(DFML$Titiwangsa)
DFML$Wangsa.Maju <- as.factor(DFML$Wangsa.Maju)
DFML$Batu <- as.factor(DFML$Batu)
DFML$Kepong <- as.factor(DFML$Kepong)
DFML$Segambut.South <- as.factor(DFML$Segambut.South)
DFML$Lembah.Pantai <- as.factor(DFML$Lembah.Pantai)
DFML$Seputeh <- as.factor(DFML$Seputeh)
DFML$Bandar.Tun.Razak <- as.factor(DFML$Bandar.Tun.Razak)
DFML$Cheras <- as.factor(DFML$Cheras)
DFML$Furnishing_Rank <- as.factor(DFML$Furnishing_Rank)
DFML$Terrace <- as.factor(DFML$Terrace)
DFML$Condominium <- as.factor(DFML$Condominium)
DFML$`Serviced Residence` <- as.factor(DFML$`Serviced Residence`)
DFML$Townhouse <- as.factor(DFML$Townhouse)
DFML$deviation <- as.factor(DFML$deviation)

#Step 3: Train the KNN model with 80:20 split for training =:testing and fold set at 5 to avoid overfitting.
set.seed(3333)

split <- 0.8

trainIndex <- createDataPartition(DFML$deviation, p = split, list= FALSE)
data_train <- DFML[trainIndex, ]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
data_test <- DFML[-trainIndex,]

trctrl <- trainControl(method = "cv", number = 5)

model <- train(deviation~., method = "knn", data = data_train, metric = "Accuracy", trControl = trctrl)
model
## k-Nearest Neighbors 
## 
## 9777 samples
##   18 predictor
##    3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 7822, 7822, 7822, 7821, 7821 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6828276  0.4996129
##   7  0.6824194  0.4984215
##   9  0.6799634  0.4939164
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
#Step 4: Prediction against data_test and evaluate the result using confusion matrix. 
predictions <- predict(model, data_test)

confusionMatrix(predictions,data_test$deviation)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 513 219  31
##          2 237 783 182
##          3  16  89 373
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6832          
##                  95% CI : (0.6643, 0.7016)
##     No Information Rate : 0.4466          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5014          
##                                           
##  Mcnemar's Test P-Value : 3.763e-08       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.6697   0.7177   0.6365
## Specificity            0.8509   0.6901   0.9435
## Pos Pred Value         0.6723   0.6514   0.7803
## Neg Pred Value         0.8494   0.7518   0.8916
## Prevalence             0.3135   0.4466   0.2399
## Detection Rate         0.2100   0.3205   0.1527
## Detection Prevalence   0.3123   0.4920   0.1957
## Balanced Accuracy      0.7603   0.7039   0.7900

Conclusion

Question 1 From our regression model, the independent variables are able to explain 58% (R-squared) of the variation in property’s price per square feet and most of the independent variables ranging from number of rooms to its district are statistically significant. Note that the number of bathrooms and bedrooms have negative coefficients which reduced the property’s price psf. The rationale is likely that the increase in size of the property outgrow the overall property’s price as developers are able to procure cheaper material cost given the extra quantity. On the districts, properties are generally priced higher than the properties located in North of Segambut, except for Kepong.

Question 2 Our KNN model generally predicted correctly 68.6% of the overall test data. As the p-value [Acc >NIR] is significant, our model is able to add value to the prediction as compared prediction with no information at all. Therefore, it is safe to say that our model is able to correctly predict an overvalued or undervalued property more than half of the time given its district, type of property and additional features as input.