Malaysia’s High Rise Residential Unit Price

options(warn = -1)

WQD 7004 Group project by Group 16

2023/2024 Semester 1

Instructor: Dr. Ang Tan Fong

Group 16 Members:

Tun Hao Wong (S23062237)
- Packages Loading
- Overview of Data
- Data Preparation and Data Cleaning
- Exploratory Data Analysis
Amanda Lee See Wai (23053894)
- Exploratory Data Analysis
Kelly Chan Siu Thon (s2153358)
- Regression, Model Evaluation and Insights
Chua Xiao ying (22119324)
- Classification, Model Evaluation and Insights
Lo Vui Kiong (23051586)
- Regression, Model Evaluation and Insights

Summary

This project was completed using a dataset acquired from Kaggle (https://www.kaggle.com/datasets/mcpenguin/raw-malaysian-housing-prices-data/data).

This dataset contains more than four thousand condominium unit listings on the Malaysian housing website mudah.my. The dataset was last scraped and updated in September 2023.

Objective

The two main objectives of this project are:

to predict the condominium prices in Malaysia
to classify the condominiums into three categories, which are affordable condominiums, middle-ranges condominums and high end condominum

Project Steps

Packages Loading
Overview of Data
Data Preparation
Exploratory data analysis
Predictive Modeling
Conclusion & Future Works

0. Packages Loading

Before exploring the data and building the models, we need to load some necessary packages and call the libraries for this analysis.

install.packages(“caret”)

# For data manipulation and visualization
library(dplyr)        # For data manipulation.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)    # Designed to make it easier to work with dates and times

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidymodels)   # For modeling and machine learning using tidy principles

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tibble       3.2.1
## ✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
## ✔ infer        1.0.5     ✔ tune         1.1.2
## ✔ modeldata    1.2.0     ✔ workflows    1.1.3
## ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
## ✔ purrr        1.0.2     ✔ yardstick    1.2.0
## ✔ recipes      1.0.9

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.

library(caret)        # Streamline the model training process for regression & classification problems

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:purrr':
## 
##     lift

# library(leaflet)    # Creates an interactive map
library(GGally)       # Extends ggplot2 by adding several functions to reduce the complexity of combining geoms with transformed data.

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(Amelia)       # Used for imputing missing data

## Loading required package: Rcpp

## 
## Attaching package: 'Rcpp'

## The following object is masked from 'package:rsample':
## 
##     populate

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2024 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(corrplot)     # Visualizing a correlation matrix in R

## corrplot 0.92 loaded

library(cluster)      # Provides functions for cluster analysis
library(clustertend)  # Provides indices for determining the number of clusters in a dataset

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

library(factoextra)   # Provides a collection of functions to extract clusters and assess clustering tendency

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(caTools)      # Provides various functions for sampling and splitting datasets
library(tm)           # Provides a framework for managing and analyzing textual data

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(SnowballC)    # Provides stemming algorithms for text analysis. 
library(wordcloud)    # To create visual representations of word frequencies in a corpus of text

## Loading required package: RColorBrewer

library(ggmap)        # For visualizing spatial data & creating maps by combining ggplot2 & Google Maps

## ℹ Google's Terms of Service: <https://mapsplatform.google.com>
##   Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
##   OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.

library(RColorBrewer) # Package in R provides color palettes designed for thematic mapping and data visualization
library(e1071)        # For support vector machines (SVM) and other statistical learning methods

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

library(gbm)          # For gradient boosting algorithms, a powerfu l tool in predictive modeling

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(nnet)         # For multinomial logistic regression

1. Overview of Data

1.1 Importing the Dataset

file_path <- "C:\\Users\\User\\OneDrive\\Desktop\\Programming for Data Science\\houses.csv"
df <- read.csv(file_path)

1.2 Overview of Imported Dataset

str(df)

## 'data.frame':    4000 obs. of  32 variables:
##  $ description           : chr  "Iconic Building @ KL SETAPAK\nNew launching & Latest condo !!!!! 🔥\nHouse with luxury hotel concept 😍👑\n💎 Freeh"| __truncated__ "FOR SALE @ RM250,000\nIntroduction:\n~ Pangsapuri Kenanga @ Kampung Lapan\n~ 980 sqft\n~ 3 Bedrooms & 2 Bathroo"| __truncated__ "[Below Market] Sri Lavender Apartment,Tmn Sepakat Indah,100% FULL LOAN\nRM 230,000\n(💥BELOW MARKET VALUE💥)\nMAR"| __truncated__ "Flat Pandan Indah\nJalan Pandan Indah 3/3\nNon Bumi lot, 100% loan\n=====================\n• Walk up flat\n• No"| __truncated__ ...
##  $ Bedroom               : chr  "4" "3" "3" "3" ...
##  $ Bathroom              : chr  "2" "2" "2" "1" ...
##  $ Property.Size         : chr  "1000 sq.ft." "980 sq.ft." "1000 sq.ft." "592 sq.ft." ...
##  $ Nearby.School         : chr  "Sekolah Menengah Pendidikan Khas Cacat Penglihatan" "" "" "" ...
##  $ Nearby.Mall           : chr  "Setapak Central" "" "" "" ...
##  $ Ad.List               : int  98187451 101683090 103792905 103806240 103806234 103739787 103690767 103615852 103615849 102460346 ...
##  $ Category              : chr  "Apartment / Condominium, For sale" "Apartment / Condominium, For sale" "Apartment / Condominium, For sale" "Apartment / Condominium, For sale" ...
##  $ Facilities            : chr  "-" "Parking, Security, Swimming Pool, Playground, Barbeque area, Jogging Track" "Playground, Minimart, Jogging Track, Barbeque area, Parking, Security, Lift" "Parking, Playground, Minimart, Jogging Track" ...
##  $ Building.Name         : chr  "Kenwingston Platz" "Kenanga (Park View Court)" "Sri Lavender Apartment" "Flat Pandan Indah" ...
##  $ Developer             : chr  "Kenwingston Group" "-" "TLS Group" "-" ...
##  $ Tenure.Type           : chr  "Freehold" "Freehold" "Freehold" "Leasehold" ...
##  $ Address               : chr  "Jalan Gombak, Setapak, Kuala Lumpur" "Jalan Kenanga 3/8, Melaka City, Melaka" "Jalan Sepakat Indah 2/1, Taman Sepakat Indah 2, Kajang, Selangor" "jalan pandan indah 3/3, Selangor, Ampang" ...
##  $ Completion.Year       : chr  "-" "-" "2007" "-" ...
##  $ X..of.Floors          : chr  "-" "-" "13" "-" ...
##  $ Total.Units           : chr  "-" "-" "445" "-" ...
##  $ Property.Type         : chr  "Service Residence" "Apartment" "Apartment" "Flat" ...
##  $ Parking.Lot           : chr  "2" "1" "1" "1" ...
##  $ Floor.Range           : chr  "-" "Low" "Medium" "-" ...
##  $ Land.Title            : chr  "Non Bumi Lot" "Non Bumi Lot" "Non Bumi Lot" "Non Bumi Lot" ...
##  $ Firm.Type             : chr  "VE" "E" "-" "E" ...
##  $ Firm.Number           : chr  "30338" "30812" "-" "11584" ...
##  $ REN.Number            : chr  "-" "REN 15862" "-" "REN 16279" ...
##  $ Bus.Stop              : chr  "Bus Stop Starparc Point\nBus Stop Setapak Central\nBus Stop Setapak Sentral (Opp)\nBus Stop Columbia Hospital\n"| __truncated__ "" "" "" ...
##  $ Mall                  : chr  "Setapak Central" "" "" "" ...
##  $ Park                  : chr  "Park at Taman Tasik Danau Kota, Setapak, Kuala Lumpur, Malaysia\nPark at Taman Danau Kota, Setapak, Kuala Lumpu"| __truncated__ "" "" "" ...
##  $ School                : chr  "Sekolah Menengah Pendidikan Khas Cacat Penglihatan\nSekolah Kebangsaan Danau Kota\nSJK (C) Wangsa Maju\nKolej V"| __truncated__ "" "" "" ...
##  $ Hospital              : chr  "Columbia Asia Hospital" "" "" "" ...
##  $ price                 : chr  "RM 340 000" "RM 250 000" "RM 230 000" "RM 158 000" ...
##  $ Highway               : chr  "" "" "SILK Sg Ramal (T) Toll Plaza" "" ...
##  $ Nearby.Railway.Station: chr  "" "" "" "" ...
##  $ Railway.Station       : chr  "" "" "" "" ...

summary(df)

##  description          Bedroom            Bathroom         Property.Size     
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Nearby.School      Nearby.Mall           Ad.List            Category        
##  Length:4000        Length:4000        Min.   : 30964923   Length:4000       
##  Class :character   Class :character   1st Qu.:102384201   Class :character  
##  Mode  :character   Mode  :character   Median :103350207   Mode  :character  
##                                        Mean   :102443246                     
##                                        3rd Qu.:103782293                     
##                                        Max.   :103806285                     
##   Facilities        Building.Name       Developer         Tenure.Type       
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Address          Completion.Year    X..of.Floors       Total.Units       
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Property.Type      Parking.Lot        Floor.Range         Land.Title       
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Firm.Type         Firm.Number         REN.Number          Bus.Stop        
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      Mall               Park              School            Hospital        
##  Length:4000        Length:4000        Length:4000        Length:4000       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     price             Highway          Nearby.Railway.Station
##  Length:4000        Length:4000        Length:4000           
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##  Railway.Station   
##  Length:4000       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

The dataset has a total of 4000 rows with 32 variables. Each variable is described as follows:

description: Full description for the unit listing.
Bedroom: Number of bedrooms in the unit.
Bathroom: Number of bathrooms in the unit.
Property Size: Size of the unit.
Nearby School: Nearby School, if any.
Nearby Mall: Nearby mall, if any.
Ad List: The ID of the listing on the website.
Category: The category of the listing. It will most likely be Apartment / Condominium, For Sale.
Facilities: The facilities that the apartment has, in a comma-separated list.
Building Name: Name of the building.
Developer: Developer for the building.
Tenure Type: Type of tenure for the building.
Address: Address of the building.
Completion Year: Completion year of the building. If the building is still under construction, this is listed as -.
Numbers of Floors: Number of floors in the building.
Total Units: Total number of units in the building.
Property Type: Type of property.
Parking Lot: Number of parking lots assigned to the unit, if any.
Floor Range: Floor range for the building.
Land Title: Title given to the land.
Firm Type: Type of firm who posted the listing.
Firm Number: ID of the firm who posted the listing.
REN Number: REN number of the firm who posted the listing.
Bus Stop: Nearby Bus Stop, if any.
Mall: Nearby mall, if any.
Park: Nearby park, if any.
School: Nearby school, if any.
Hospital: Nearby hospital, if any.
price: Price of the unit. Our target variable in objective a)
Highway: Nearby Highway, if any.
Nearby Railway Station: Nearby railway station, if any.
Railway Station: Nearby railway station, if any

2. Data Pre-processing

2.1 Data Cleaning

a) Standardization of Feature Naming

First letter of each feature is always capital letter.
Space between multi-word feature name is represented with _.

names(df)[names(df) == "description"] <- "Description"
names(df)[names(df) == "price"] <- "Price"
names(df) <- gsub("\\.", "_", names(df))

b) Detecting Null Rows

Based on ad list and description (both are mandatory for each unit listing to be unique and valid), there are no null rows detected.

df[df == ""] <- NA
par(mfrow = c(1, 2))


missmap(df[, 1:16], main = "First 16 variables")

missmap(df[, 17:32], main = "Last 16 variables")

df[df == "NA"] <- ""

c) Detecting Duplicated Rows

count_duplicated_rows <- sum(duplicated(df) & !duplicated(df, fromLast = TRUE))

cat("Number of duplicated rows:", count_duplicated_rows, "\n")

## Number of duplicated rows: 185

As can be seen from the output above, 185 duplicate rows were detected in the dataset, which all the features have the same values. Duplicate rows can skew the results of the analysis or model, take up unnecessary space, and slow down processing; therefore, it is beneficial to remove duplicate rows from the dataset to ensure accurate results and efficient processing.

df <- df[!(duplicated(df) & !duplicated(df, fromLast = TRUE)), ]
count_duplicated_rows <- sum(duplicated(df) & !duplicated(df, fromLast = TRUE))

cat("Number of duplicated rows:", count_duplicated_rows, "\n")

## Number of duplicated rows: 0

To ensure all rows are unique, further inspection are executed by checking if there are any rows with the same Ad_List number, which is supposed to be unique for each posting.

duplicated_Ad_List <- df[duplicated(df$Ad_List) | duplicated(df$Ad_List, fromLast = TRUE), ]
num_duplicated_Ad_List <- nrow(duplicated_Ad_List)
cat("Number of duplicated entries in Ad_List:", num_duplicated_Ad_List, "\n")

## Number of duplicated entries in Ad_List: 44

44 Duplicated Ad_List are detected. However, those with same Ad_List have different information in each feature, with one row of information more than the other. This is most likely due to human error when users create the postings on mudah.com Hence, the following are executed:

to create a column called feature_information to find the total number of information available for each row;and
delete the duplicated rows that has the least information

df$feature_information <- rowSums(!is.na(df[, !names(df) %in% "Ad.List"]))
df <- df %>%
  arrange(Ad_List, feature_information)

df <- df[!duplicated(df$Ad_List, fromLast = TRUE), ]
df <- subset(df, select = -feature_information)

2.2 Data Transformation

a) Facilities: The Facilities column is a list and will be transformed using one-hot encoding to analyze the presence/absence of each facility.

df <- df %>%
  separate_rows(Facilities, sep = ", ") %>%
  mutate(value = 1) %>%
  spread(Facilities, value, fill = 0)

To reduce dimensionality for list of facilities:

The column of “-” representing no facilities will be dropped as it can be predicted from the absence of other columns.
The column of 10 will be removed as it is believed to be a typing mistake from one of the facilities list.

df <- df %>% select(-`-`, -`10`)

b) Address: The state information will be extracted from the location column for our properties frequency map visualization.

# List of state names
states <- c("Johor", "Kedah", "Kelantan", "Kuala Lumpur", "Melaka", "Negeri Sembilan", 'Penang', "Pahang",
            "Perak", "Perlis", "Pulau Pinang", "Sabah", "Sarawak", "Selangor", "Terengganu", "Labuan", "Putrajaya")

# Function to get state
get_state <- function(x) {
  if (!is.na(x)) {
    x_list <- strsplit(x, ',')[[1]]
    stripped <- sapply(x_list, function(s) trimws(s))
    common_elements <- intersect(states, stripped)
    
    if (length(common_elements) > 0) {
      return(common_elements[1])
    } else {
      return(NA)
    }
  } else {
    return(NA)
  }
}

df$State <- sapply(df$Address, get_state)

Regarding the CompletionYear variable, converting it into a numeric format is recommended. Additionally, a flag should be added to indicate whether the development of the building has been completed or not.

c) Conversion to Numerical Columns

Several variables should ideally be numeric but are not in our current dataset. A thorough investigation is warranted to understand why these variables were not automatically classified as numeric. Subsequent manipulations or augmentations of these columns are required to ensure they become numeric.

1. Number of Bedroom and Number of Bathroom & Replacing Null values

Following codes that there is a row of data with Number of Bedroom and Bathroom as "-". Using Hot-Deck Imputation, the numm value of Bedroom and Bathroom are replaced with the first row of matched data with the csmae Property Size. This is done as logically every house must have bathroom and bedroom.

unique(df$Bedroom)

## [1] "4"  "3"  "6"  "5"  "1"  "2"  "8"  "10" "-"

unique(df$Bathroom)

## [1] "3" "2" "4" "7" "5" "1" "8" "6" "-"

number_of_rows <- nrow(df[df$Bedroom == "-" | df$Bathroom == "-", ])
cat("Number of rows where Bedroom or Bathroom is equal to '-':", number_of_rows, "\n")

## Number of rows where Bedroom or Bathroom is equal to '-': 1

for (i in 1:nrow(df)) {
  if (df$Bedroom[i] == "-" | df$Bathroom[i] == "-") {
    current_row <- df[i, ]
  
    matching_rows <- df[df$Property_Size == current_row$Property_Size & 
                        !(df$Bedroom == "-" & df$Bathroom == "-"), ]
  
    if (nrow(matching_rows) > 0) {
      matching_row <- matching_rows[1, ]
    
      replace_columns <- c("Bedroom", "Bathroom")
    
      df[i, replace_columns] <- matching_row[replace_columns]
    }
  }
}

print(unique(df$Bedroom))

## [1] "4"  "3"  "6"  "5"  "1"  "2"  "8"  "10"

print(unique(df$Bathroom))

## [1] "3" "2" "4" "7" "5" "1" "8" "6"

number_of_rows <- nrow(df[df$Bedroom == "-" | df$Bathroom == "-", ])
cat("Number of rows where Bedroom or Bathroom is equal to '-':", number_of_rows, "\n")

## Number of rows where Bedroom or Bathroom is equal to '-': 0

df$Bedroom <- as.integer(df$Bedroom)
df$Bathroom <- as.integer(df$Bathroom)

2. Price

df$Price <- gsub("[[:space:],RM]", "", df$Price)
df$Price <- as.numeric(df$Price)
print(head(unique(df$Price), 5))

## [1] 538000 350000 336000 448000 328000

 3. Property Size

df$Property_Size <- gsub("sq.ft.", "", df$Property_Size)

df$Property_Size <- as.integer(df$Property_Size)
print(sort(unique(df$Property_Size)))

##   [1]      1      9    280    295    318    383    388    391    400    404
##  [11]    407    409    410    416    418    420    422    430    434    438
##  [21]    442    447    450    452    454    455    460    463    464    465
##  [31]    466    467    468    476    477    480    484    494    495    497
##  [41]    500    506    510    512    513    514    516    517    519    520
##  [51]    521    522    525    526    527    530    531    533    535    538
##  [61]    540    541    542    547    548    549    550    551    552    553
##  [71]    554    557    560    565    570    571    572    573    575    576
##  [81]    578    579    580    581    590    592    600    603    604    605
##  [91]    606    608    609    610    613    614    615    618    619    620
## [101]    622    624    625    626    628    630    632    633    635    636
## [111]    640    642    645    646    648    649    650    652    654    655
## [121]    656    657    658    660    661    663    665    667    668    670
## [131]    671    675    676    678    679    680    682    685    686    687
## [141]    688    689    692    697    698    700    701    703    705    707
## [151]    708    710    711    712    714    715    717    719    720    721
## [161]    723    724    726    727    728    730    731    732    733    734
## [171]    735    736    737    738    740    742    743    745    746    748
## [181]    749    750    751    753    754    755    756    759    760    762
## [191]    764    765    768    769    770    771    773    774    775    776
## [201]    777    778    779    780    782    783    786    788    789    790
## [211]    791    792    794    795    797    798    799    800    801    802
## [221]    803    804    805    806    807    809    810    811    812    813
## [231]    814    815    816    818    819    820    825    826    828    829
## [241]    830    832    833    834    835    836    837    838    840    841
## [251]    842    843    844    845    847    848    849    850    851    852
## [261]    853    854    855    856    857    858    860    861    862    863
## [271]    864    865    866    867    868    870    871    872    873    874
## [281]    875    876    877    878    880    881    882    883    884    886
## [291]    887    888    890    891    893    894    895    897    899    900
## [301]    901    902    903    904    905    906    907    908    909    910
## [311]    911    912    913    914    915    916    917    918    920    921
## [321]    922    925    926    927    928    929    930    931    932    933
## [331]    934    935    936    938    939    940    941    943    944    945
## [341]    946    947    948    949    950    951    953    954    955    956
## [351]    957    958    960    961    963    964    965    966    967    969
## [361]    970    972    973    974    975    979    980    982    983    984
## [371]    986    987    989    990    991    992    993    994    995    996
## [381]    997    998    999   1000   1001   1002   1003   1004   1005   1006
## [391]   1007   1008   1009   1010   1011   1012   1015   1016   1017   1018
## [401]   1019   1020   1021   1022   1023   1024   1025   1028   1029   1030
## [411]   1033   1034   1035   1036   1037   1038   1039   1040   1041   1043
## [421]   1044   1045   1046   1047   1048   1049   1050   1051   1052   1053
## [431]   1054   1055   1056   1058   1059   1060   1062   1063   1065   1066
## [441]   1068   1069   1072   1073   1074   1075   1076   1077   1079   1080
## [451]   1081   1082   1084   1085   1087   1088   1090   1091   1092   1093
## [461]   1095   1096   1097   1098   1100   1101   1102   1103   1104   1105
## [471]   1106   1108   1109   1110   1111   1112   1113   1114   1115   1116
## [481]   1117   1118   1119   1120   1121   1123   1124   1126   1128   1130
## [491]   1131   1133   1135   1139   1140   1141   1142   1143   1144   1148
## [501]   1149   1150   1151   1152   1153   1155   1156   1159   1160   1161
## [511]   1162   1164   1165   1167   1168   1169   1170   1171   1173   1174
## [521]   1175   1177   1179   1180   1181   1182   1183   1184   1185   1186
## [531]   1188   1190   1191   1195   1196   1197   1198   1200   1202   1204
## [541]   1205   1206   1207   1210   1211   1212   1213   1214   1216   1219
## [551]   1221   1222   1223   1224   1225   1226   1227   1228   1230   1231
## [561]   1232   1233   1234   1237   1238   1239   1240   1243   1244   1246
## [571]   1249   1250   1251   1254   1257   1258   1259   1260   1261   1262
## [581]   1265   1267   1268   1270   1271   1272   1273   1274   1275   1276
## [591]   1277   1280   1281   1282   1283   1285   1288   1292   1294   1295
## [601]   1299   1300   1302   1304   1307   1309   1310   1311   1313   1314
## [611]   1315   1316   1319   1320   1323   1324   1328   1329   1331   1333
## [621]   1334   1335   1338   1340   1343   1345   1346   1349   1350   1352
## [631]   1353   1357   1363   1364   1365   1366   1367   1372   1376   1378
## [641]   1382   1384   1385   1386   1387   1389   1390   1392   1396   1398
## [651]   1399   1400   1403   1405   1406   1409   1410   1411   1417   1419
## [661]   1421   1422   1424   1431   1432   1433   1442   1445   1450   1453
## [671]   1458   1459   1463   1464   1465   1467   1469   1470   1473   1474
## [681]   1475   1477   1480   1483   1484   1485   1487   1488   1490   1492
## [691]   1493   1495   1496   1500   1504   1507   1508   1513   1515   1518
## [701]   1519   1520   1522   1523   1528   1530   1533   1538   1539   1540
## [711]   1547   1550   1556   1557   1561   1566   1567   1570   1579   1582
## [721]   1588   1590   1598   1600   1616   1626   1636   1642   1645   1646
## [731]   1650   1651   1654   1661   1667   1668   1670   1680   1690   1695
## [741]   1700   1701   1708   1710   1711   1722   1742   1750   1755   1765
## [751]   1766   1788   1800   1818   1819   1824   1828   1830   1849   1850
## [761]   1853   1862   1884   1894   1900   1905   1936   1946   1960   1968
## [771]   1982   1986   1991   2000   2020   2044   2045   2071   2080   2088
## [781]   2101   2124   2184   2185   2200   2214   2219   2233   2236   2240
## [791]   2250   2253   2266   2293   2300   2303   2304   2306   2314   2330
## [801]   2333   2335   2337   2368   2400   2415   2476   2509   2510   2524
## [811]   2526   2600   2650   2670   2723   2729   2750   2766   2800   2809
## [821]   2872   2874   2883   2896   2907   2987   3000   3070   3072   3200
## [831]   3350   3701   3735   3878   3900   4177   4300   4370   4500   4800
## [841]   5000   5619   6000   6016   9376   9800  14500  17611 122774

Based on the above, Property size of 1,9 and more than 10000 square feet is unlikely will be treated in outliers detection section later.

 4. Number of Floors

print(unique(df$X__of_Floors))

##  [1] "15"  "-"   "24"  "4"   "30"  "16"  "23"  "40"  "5"   "14"  "21"  "22" 
## [13] "19"  "10"  "20"  "17"  "12"  "27"  "18"  "3"   "13"  "46"  "26"  "36" 
## [25] "29"  "31"  "52"  "42"  "33"  "11"  "37"  "9"   "44"  "28"  "56"  "38" 
## [37] "25"  "8"   "6"   "34"  "32"  "35"  "49"  "2"   "43"  "39"  "45"  "435"
## [49] "7"   "41"  "50"  "58"  "63"  "55"  "500" "47"  "48"  "504" "135" "60" 
## [61] "486"

df$X__of_Floors <- as.integer(df$X__of_Floors)

average_x_of_floors <- as.integer(mean(df$X__of_Floors, na.rm = TRUE))
df$X__of_Floors[is.na(df$X__of_Floors)] <- average_x_of_floors

print(sort(unique(df$X__of_Floors)))

##  [1]   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
## [20]  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39
## [39]  40  41  42  43  44  45  46  47  48  49  50  52  55  56  58  60  63 135 435
## [58] 486 500 504

There is a row with “-” in the “number of floors” column. After changing the “number of floors” column to integer type, the “-” row is replaced with an average value. Based on the above, number of floors of more than 100 is unlikely will be treated in outliers detection section later.

5. Total Units

print(unique(df$Total_Units))

##   [1] "192"  "-"    "538"  "809"  "384"  "635"  "551"  "391"  "75"   "249" 
##  [11] "512"  "2115" "1360" "226"  "139"  "414"  "244"  "360"  "625"  "464" 
##  [21] "808"  "234"  "576"  "816"  "436"  "243"  "606"  "222"  "654"  "500" 
##  [31] "486"  "1235" "455"  "480"  "357"  "960"  "181"  "117"  "730"  "1200"
##  [41] "420"  "392"  "675"  "1120" "160"  "684"  "1343" "519"  "3"    "460" 
##  [51] "150"  "1036" "198"  "107"  "1338" "1100" "1000" "450"  "531"  "260" 
##  [61] "600"  "419"  "359"  "416"  "34"   "1633" "1932" "702"  "550"  "504" 
##  [71] "750"  "241"  "792"  "309"  "20"   "299"  "188"  "306"  "276"  "348" 
##  [81] "238"  "672"  "800"  "760"  "410"  "228"  "510"  "1571" "99"   "376" 
##  [91] "476"  "96"   "433"  "659"  "252"  "819"  "470"  "467"  "836"  "1342"
## [101] "322"  "632"  "128"  "1255" "190"  "539"  "595"  "189"  "396"  "548" 
## [111] "950"  "462"  "621"  "469"  "68"   "332"  "616"  "803"  "120"  "280" 
## [121] "540"  "326"  "295"  "182"  "711"  "1054" "109"  "763"  "112"  "1400"
## [131] "980"  "319"  "136"  "670"  "596"  "1008" "372"  "944"  "54"   "227" 
## [141] "771"  "744"  "787"  "542"  "240"  "900"  "7810" "184"  "312"  "91"  
## [151] "230"  "210"  "398"  "328"  "358"  "94"   "1064" "560"  "1188" "104" 
## [161] "196"  "2003" "1443" "406"  "286"  "175"  "412"  "403"  "218"  "378" 
## [171] "1050" "727"  "200"  "704"  "311"  "350"  "237"  "572"  "202"  "454" 
## [181] "536"  "310"  "553"  "397"  "1843" "89"   "229"  "225"  "706"  "528" 
## [191] "1128" "1168" "336"  "140"  "746"  "156"  "149"  "180"  "300"  "195" 
## [201] "100"  "1196" "712"  "303"  "1002" "3045" "400"  "2136" "1900" "669" 
## [211] "628"  "135"  "250"  "144"  "664"  "891"  "176"  "71"   "36"   "8"   
## [221] "3600" "522"  "52"   "552"  "981"  "82"   "203"  "90"   "588"  "98"  
## [231] "1074" "1220" "529"  "1440" "354"  "736"  "344"  "444"  "59"   "482" 
## [241] "67"   "324"  "1134" "418"  "1737" "435"  "1709" "448"  "608"  "316" 
## [251] "899"  "2155" "208"  "798"  "639"  "368"  "298"  "342"  "256"  "838" 
## [261] "115"  "832"  "122"  "1040" "17"   "134"  "634"  "534"  "474"  "558" 
## [271] "956"  "1012" "912"  "720"  "490"  "417"  "698"  "340"  "700"  "676" 
## [281] "40"   "452"  "521"  "811"  "352"  "766"  "1211" "445"  "946"  "493" 
## [291] "650"  "428"  "290"  "209"  "95"   "780"  "437"  "988"  "754"  "723" 
## [301] "284"  "1274" "148"  "142"  "168"  "592"  "537"  "1240" "1020" "1101"
## [311] "554"  "618"  "526"  "390"  "998"  "1311" "701"  "733"  "363"  "478" 
## [321] "580"  "177"  "205"  "224"  "60"   "186"  "72"   "204"  "147"  "690" 
## [331] "1106" "1115" "1"    "505"  "105"  "375"  "320"  "270"  "236"  "254" 
## [341] "1472" "272"  "496"  "475"  "216"  "1300" "385"  "1675" "265"  "2208"
## [351] "258"  "1339" "1148" "718"  "274"  "308"  "422"  "456"  "434"  "132" 
## [361] "2685" "938"  "80"   "1160" "294"  "76"   "630"  "1456" "304"  "220" 
## [371] "48"   "453"  "473"  "1598" "849"  "288"  "520"  "2647" "79"   "285" 
## [381] "2240" "715"  "1502" "4225" "488"  "382"  "613"  "408"  "438"  "440" 
## [391] "752"  "1378" "708"  "561"  "556"  "246"  "173"  "640"  "1205" "42"  
## [401] "297"  "356"  "932"  "296"  "50"   "525"  "114"  "565"  "1112" "2000"
## [411] "516"  "636"  "685"  "682"  "1632" "174"  "281"  "1013" "461"  "248" 
## [421] "472"  "1332" "84"   "199"  "631"  "292"  "424"  "102"  "239"  "648" 
## [431] "162"  "569"  "477"  "152"  "1929" "1022" "846"  "881"  "346"  "457" 
## [441] "1149" "369"  "1033" "916"  "666"  "1413" "831"  "734"  "973"  "374" 
## [451] "271"  "388"  "564"  "495"  "447"  "77"   "1424" "73"   "660"  "116" 
## [461] "247"  "585"  "1317" "629"  "594"  "756"  "492"  "302"  "1296" "961" 
## [471] "644"  "1052" "245"  "768"  "266"  "508"  "1163" "714"  "242"  "151" 
## [481] "108"  "825"  "443"  "405"  "1326" "65"   "547"  "2296" "215"  "259" 
## [491] "2509" "364"  "498"  "1516" "426"  "155"  "1388" "138"  "3224" "23"  
## [501] "1039" "212"

df$Total_Units <- as.integer(df$Total_Units)

average_Total_Units <- as.integer(mean(df$Total_Units, na.rm = TRUE))
df$Total_Units[is.na(df$Total_Units)] <- average_Total_Units

There is a row with “-” in the “total units” column. After changing the “total units” column to numeric type, the “-” row is replaced with an NA value.

 6. Completion Year

print(unique(df$Completion_Year))

##  [1] "-"    "2020" "2019" "2012" "2014" "2006" "2003" "2017" "2016" "1994"
## [11] "2013" "2018" "2007" "2010" "2005" "2015" "2021" "2000" "2009" "2023"
## [21] "1992" "2008" "1998" "2011" "2022" "2002" "1997" "1999" "2004" "1996"
## [31] "1991" "2025" "1987" "1995" "2024" "2026" "2001" "1990" "1989" "1993"
## [41] "1985"

df$Completion_Year <- as.integer(df$Completion_Year)

average_Completion_Year <- as.integer(mean(df$Completion_Year, na.rm = TRUE))
df$Completion_Year[is.na(df$Completion_Year)] <- average_Completion_Year

print(sort(unique(df$Completion_Year)))

##  [1] 1985 1987 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## [16] 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## [31] 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026

7. Parking Lot

print(unique(df$Parking_Lot))

## [1] "2"  "-"  "1"  "3"  "4"  "5"  "10" "6"  "9"

df$Parking_Lot <- gsub("-", '0', df$Parking_Lot)

df$Parking_Lot <- as.integer(df$Parking_Lot)

print(unique(df$Parking_Lot))

## [1]  2  0  1  3  4  5 10  6  9

####d) Detecting and Handling Outliers

columns_of_interest <- c("Bedroom", "Bathroom", "Property_Size","Price", "Completion_Year", "X__of_Floors", "Total_Units", "Parking_Lot")

par(mfrow = c(1, 2)) 

for (col in columns_of_interest) {
  boxplot(df[[col]], main = col, ylab = "Values")
}

Based on observations above, two columns with outliers need to be treated:

Property Size that is 1,9 and more than 10000 square feet in Malaysia is very unlikely.
number of floors that are more than 100 are very unheard of and most likely due to user input errors.

filtered_property_size <- df$Property_Size[df$Property_Size > 10 & df$Property_Size < 10000]

average_property_size <- as.integer(mean(filtered_property_size))

df$Property_Size[df$Property_Size < 10 | df$Property_Size > 10000] <- average_property_size

print(sort(unique(df$Property_Size)))

##   [1]  280  295  318  383  388  391  400  404  407  409  410  416  418  420  422
##  [16]  430  434  438  442  447  450  452  454  455  460  463  464  465  466  467
##  [31]  468  476  477  480  484  494  495  497  500  506  510  512  513  514  516
##  [46]  517  519  520  521  522  525  526  527  530  531  533  535  538  540  541
##  [61]  542  547  548  549  550  551  552  553  554  557  560  565  570  571  572
##  [76]  573  575  576  578  579  580  581  590  592  600  603  604  605  606  608
##  [91]  609  610  613  614  615  618  619  620  622  624  625  626  628  630  632
## [106]  633  635  636  640  642  645  646  648  649  650  652  654  655  656  657
## [121]  658  660  661  663  665  667  668  670  671  675  676  678  679  680  682
## [136]  685  686  687  688  689  692  697  698  700  701  703  705  707  708  710
## [151]  711  712  714  715  717  719  720  721  723  724  726  727  728  730  731
## [166]  732  733  734  735  736  737  738  740  742  743  745  746  748  749  750
## [181]  751  753  754  755  756  759  760  762  764  765  768  769  770  771  773
## [196]  774  775  776  777  778  779  780  782  783  786  788  789  790  791  792
## [211]  794  795  797  798  799  800  801  802  803  804  805  806  807  809  810
## [226]  811  812  813  814  815  816  818  819  820  825  826  828  829  830  832
## [241]  833  834  835  836  837  838  840  841  842  843  844  845  847  848  849
## [256]  850  851  852  853  854  855  856  857  858  860  861  862  863  864  865
## [271]  866  867  868  870  871  872  873  874  875  876  877  878  880  881  882
## [286]  883  884  886  887  888  890  891  893  894  895  897  899  900  901  902
## [301]  903  904  905  906  907  908  909  910  911  912  913  914  915  916  917
## [316]  918  920  921  922  925  926  927  928  929  930  931  932  933  934  935
## [331]  936  938  939  940  941  943  944  945  946  947  948  949  950  951  953
## [346]  954  955  956  957  958  960  961  963  964  965  966  967  969  970  972
## [361]  973  974  975  979  980  982  983  984  986  987  989  990  991  992  993
## [376]  994  995  996  997  998  999 1000 1001 1002 1003 1004 1005 1006 1007 1008
## [391] 1009 1010 1011 1012 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025
## [406] 1028 1029 1030 1033 1034 1035 1036 1037 1038 1039 1040 1041 1043 1044 1045
## [421] 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1058 1059 1060 1062
## [436] 1063 1065 1066 1068 1069 1072 1073 1074 1075 1076 1077 1079 1080 1081 1082
## [451] 1084 1085 1087 1088 1090 1091 1092 1093 1095 1096 1097 1098 1100 1101 1102
## [466] 1103 1104 1105 1106 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118
## [481] 1119 1120 1121 1123 1124 1126 1128 1130 1131 1133 1135 1139 1140 1141 1142
## [496] 1143 1144 1148 1149 1150 1151 1152 1153 1155 1156 1159 1160 1161 1162 1164
## [511] 1165 1167 1168 1169 1170 1171 1173 1174 1175 1177 1179 1180 1181 1182 1183
## [526] 1184 1185 1186 1188 1190 1191 1195 1196 1197 1198 1200 1202 1204 1205 1206
## [541] 1207 1210 1211 1212 1213 1214 1216 1219 1221 1222 1223 1224 1225 1226 1227
## [556] 1228 1230 1231 1232 1233 1234 1237 1238 1239 1240 1243 1244 1246 1249 1250
## [571] 1251 1254 1257 1258 1259 1260 1261 1262 1265 1267 1268 1270 1271 1272 1273
## [586] 1274 1275 1276 1277 1280 1281 1282 1283 1285 1288 1292 1294 1295 1299 1300
## [601] 1302 1304 1307 1309 1310 1311 1313 1314 1315 1316 1319 1320 1323 1324 1328
## [616] 1329 1331 1333 1334 1335 1338 1340 1343 1345 1346 1349 1350 1352 1353 1357
## [631] 1363 1364 1365 1366 1367 1372 1376 1378 1382 1384 1385 1386 1387 1389 1390
## [646] 1392 1396 1398 1399 1400 1403 1405 1406 1409 1410 1411 1417 1419 1421 1422
## [661] 1424 1431 1432 1433 1442 1445 1450 1453 1458 1459 1463 1464 1465 1467 1469
## [676] 1470 1473 1474 1475 1477 1480 1483 1484 1485 1487 1488 1490 1492 1493 1495
## [691] 1496 1500 1504 1507 1508 1513 1515 1518 1519 1520 1522 1523 1528 1530 1533
## [706] 1538 1539 1540 1547 1550 1556 1557 1561 1566 1567 1570 1579 1582 1588 1590
## [721] 1598 1600 1616 1626 1636 1642 1645 1646 1650 1651 1654 1661 1667 1668 1670
## [736] 1680 1690 1695 1700 1701 1708 1710 1711 1722 1742 1750 1755 1765 1766 1788
## [751] 1800 1818 1819 1824 1828 1830 1849 1850 1853 1862 1884 1894 1900 1905 1936
## [766] 1946 1960 1968 1982 1986 1991 2000 2020 2044 2045 2071 2080 2088 2101 2124
## [781] 2184 2185 2200 2214 2219 2233 2236 2240 2250 2253 2266 2293 2300 2303 2304
## [796] 2306 2314 2330 2333 2335 2337 2368 2400 2415 2476 2509 2510 2524 2526 2600
## [811] 2650 2670 2723 2729 2750 2766 2800 2809 2872 2874 2883 2896 2907 2987 3000
## [826] 3070 3072 3200 3350 3701 3735 3878 3900 4177 4300 4370 4500 4800 5000 5619
## [841] 6000 6016 9376 9800

Using mean imputation, outliers for property size (more than 10000 sq ft) are replaced via mean imputed property size.

filtered_floors <- df$X__of_Floors[df$X__of_Floors < 100]

average_floors <- as.integer(mean(filtered_floors))

df$X__of_Floors[df$X__of_Floors > 100] <- average_floors

print(sort(unique(df$X__of_Floors)))

##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## [26] 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 52
## [51] 55 56 58 60 63

Using mean imputation, outliers for number of floors (more than 100) are replaced via mean imputed number of floors.

e) Handling of Duplicated Columns

 1. Railway_Station and Nearby_Railway_Station

sum(!is.na(df$Nearby_Railway_Station))

## [1] 345

sum(!is.na(df$Railway_Station))

## [1] 451

df$Railway_Station[is.na(df$Railway_Station) & !is.na(df$Nearby_Railway_Station)] <- df$Nearby_Railway_Station

print(sum(!is.na(df$Railway_Station)))

## [1] 451

df$Nearby_Railway_Station <- NULL

After replacing values from the Nearby_Railway_Station column into the null rows of the Railway_Station column, the count of values in the Railway_Station column did not increase. This suggests that Nearby_Railway_Station is potentially an incomplete duplicate of the Railway_Station column. Duplicated columns provide no new information, may amplify the impact of the same features in modeling, slow down processing, and lead to modeling issues. Therefore, the Nearby_Railway_Station column is removed.

 2. Mall vs Nearby_Mall

sum(!is.na(df$Nearby_Mall))

## [1] 348

sum(!is.na(df$Mall))

## [1] 448

df$Mall[is.na(df$Mall) & !is.na(df$Nearby_Mall)] <- df$Nearby_Mall

print(sum(!is.na(df$Mall)))

## [1] 448

df$Nearby_Mall <- NULL

After replacing values from the Nearby_Mall column into the null rows of the Mall column, the count of values in the Mall column did not increase. This suggests that Nearby_Mall is potentially an incomplete duplicate of the Mall column. Duplicated columns provide no new information, may amplify the impact of the same features in modeling, slow down processing, and lead to modeling issues. Therefore, the Nearby_Mall column is removed.

 3. School vs Nearby_School

# Check and compare null values
sum(!is.na(df$Nearby_School))

## [1] 660

sum(!is.na(df$School))

## [1] 898

# Replace NA in School with values from Nearby_School
df$School[is.na(df$School) & !is.na(df$Nearby_School)] <- df$Nearby_School

# Print sum of non-NA values in School
print(sum(!is.na(df$School)))

## [1] 898

# Removing Nearby_School
df$Nearby_School <- NULL

After replacing values from the Nearby_School column into the null rows of the School column, the count of values in the School column did not increase. This suggests that Nearby_School is potentially an incomplete duplicate of the School column. Duplicated columns provide no new information, may amplify the impact of the same features in modeling, slow down processing, and lead to modeling issues. Therefore, the Nearby_School column is removed.

e) Encoding Categorical Columns

 1. One-Hot Encoding of School, Mall, Bus_Stop, Hospital, Highway, Railway_Station, Park (Availabilty vs Non-Availability)

# Replace NA with 0 and non-NA values with 1 for selected columns
columns_to_replace <- c("School", "Mall", "Bus_Stop", "Hospital", "Highway", "Railway_Station", "Park")

df <- df %>%
  mutate(across(all_of(columns_to_replace), ~replace(., is.na(.), 0))) %>%
  mutate(across(all_of(columns_to_replace), ~replace(., !is.na(.), 1)))

 2. One-Hot Encoding of Multiple categories (Property_Type, Land_Title, Tenure_Type)

# Assuming df is your data frame
df <- cbind(df, model.matrix(~ Property_Type + Land_Title + Tenure_Type - 1, data = df))

f) Geocoding Address for Map Visualization

#google_api<- "AIzaSyAukOytecunFP6VgmrpEQJQUiLl2gQAf_8"
#ggmap::register_google(key = google_api, write = TRUE)
# Geocode addresses to get latitude and longitude
#geocoded <- geocode(df$Address)

# Combine geocoded data with the original data frame
#df <- cbind(df, geocoded)

# Print the modified data frame
#print(df)

Consumption of Google Geocoding API, we obtain the spatial location based on available addresses.

###Saving Cleaned Dataset

write.csv(df, "houses_clean_geocoded.csv", row.names = FALSE)
file_path <- "C:\\Users\\User\\OneDrive\\Desktop\\Programming for Data Science\\houses_clean_geocoded.csv"
df1 <- read.csv(file_path)

3. Exploratory Data Analysis (EDA)

3.1 Word Cloud on Description

# Create a corpus from the Description column
corpus <- Corpus(VectorSource(df1$Description))

### 3. Exploratory Data Analysis (EDA)

#### 3.1 Word Cloud on Description

# Create a corpus from the Description column
corpus <- Corpus(VectorSource(df1$Description))

# Preprocess the text data
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)

# Remove links from text
corpus <- tm_map(corpus, content_transformer(function(x) gsub("http\\S+|www\\S+|https\\S+", "", x)))
# Remove emojis from text
corpus <- tm_map(corpus, content_transformer(function(x) iconv(x, "UTF-8", "ASCII", sub = "")))

# Filter out empty documents
corpus <- tm_map(corpus, function(x) if (length(unlist(strsplit(as.character(x), " "))) > 0) x else removeWords(x, ""))

# Create a term-document matrix
tdm <- TermDocumentMatrix(corpus)

# Convert the term-document matrix to a matrix
m <- as.matrix(tdm)

# Calculate word frequencies
word_freqs <- sort(rowSums(m), decreasing = TRUE)

# Create a data frame with the word frequencies
word_freq_df <- data.frame(word = names(word_freqs), freq = word_freqs)

# Create a word cloud with adjustments
wordcloud(
  words = word_freq_df$word,
  freq = word_freq_df$freq,
  min.freq = 1000,  # Adjust the minimum frequency threshold
  max.words = 200,  # Limit the number of words displayed
  scale = c(3, 1),  # Adjust the scale to control the font size
  colors = brewer.pal(8, "Dark2")
)

Description is mainly used by ad owners to share their contact details as mudah.com for interested buyers to directly reach out to the ad owners rather than communicating at Mudah chat application.
Description is also used to highlight some special selling points of the residential units such as near access to malls and park as well as highlighting the ’freehold” land ownership tenure.

3.2 Univariate Analysis

     a) Availability of Facilities in Condominium

# Assuming your data frame is named df
facility_counts_subset <- colSums(df1[, 28 :42])

# Sort the facility counts in descending order
sorted_facility_counts <- sort(facility_counts_subset, decreasing = TRUE)

# Set the maximum value for the Y-axis to 3000
max_y <- 3000

# Create a bar plot for the subset with adjusted bar width and sorted counts
par(mar=c(6, 3, 4, 1))
barplot(sorted_facility_counts, 
        col = "skyblue", 
        main = "Availability of Facilities",
        ylab = "Count",
        names.arg = names(sorted_facility_counts),
        cex.names = 0.8,  # Adjust font size for axis labels
        las = 2,  # Rotate axis labels by 90 degrees
        width = 0.5,  # Adjust the width of the bars
        ylim = c(0, max_y)  # Set Y-axis limits
)

Top 3 most common facilities in Malaysian condominiums are parking, security, and playgrounds. This observation closely aligns with Malaysians’ way of life. Most Malaysians own cars as their primary mode of transportation, necessitating ample parking space. Security features are paramount to safeguarding the condominium environment and providing peace of mind to residents. Playgrounds allow residents to bring their children for leisure time after work hours within the comfort and safety of their own homes.
Tennis courts, clubhouses, and squash courts are among the least common facilities in Malaysian condominiums. This makes sense for several reasons. Tennis and squash are not the most popular sports among Malaysians. Additionally, due to hectic lifestyles, many Malaysians are unable to regularly utilize clubhouse facilities. These amenities are typically offered by higher-end condominiums due to their cost and maintenance requirements.
1. Top Condominium Developer

# Filter out rows where Developer is not '-'
developer_subset <- subset(df1, Developer != "-")

# Create a bar plot using base R barplot function
developer_counts <- table(developer_subset$Developer)

# Sort developer_counts in descending order
developer_counts_sorted <- developer_counts[order(-developer_counts)]

# Select only the top 10 values
top_developers <- developer_counts_sorted[1:10]

# Set up margins if needed
par(mar=c(10, 4, 1, 2))

# Plot the bar graph for the top 10 values
barplot(top_developers, 
        col = "skyblue", 
        main = "Top 10 Developers of Malaysia Condominiums",
        cex.names = 0.8,  # Adjust font size for axis labels
        las = 2  # Rotate axis labels by 90 degrees
)

c) Completion Year

# Filter out rows where Completion Year is not '-'
year_subset <- subset(df1, Completion_Year != "-")

# Create a table of counts
year_counts <- table(year_subset$Completion_Year)

# Convert to data frame for plotting
year_counts_df <- data.frame(Completion_Year = as.integer(names(year_counts)),
                              Count = as.integer(year_counts))

# Sort by Completion_Year
year_counts_df <- year_counts_df[order(year_counts_df$Completion_Year), ]

# Set up margins if needed
par(mar = c(6, 4, 2, 2))

# Plot the line chart with custom x-axis labels
plot(year_counts_df$Completion_Year, 
     year_counts_df$Count,
     type = "l",  # 'l' for lines
     col = "skyblue",
     xlab = "Completion Year",
     ylab = "Count",
     main = "Count of Malaysia Condominiums by Completion Year",
     cex.axis = 0.8,  # Adjust font size for axis labels
     las = 2,  # Rotate axis labels
     xaxt = "n"  # Don't plot x-axis labels yet
)

# Define custom x-axis labels
custom_labels <- c("1970", "1980", "1990", "2000", "2010", "2020")

# Add custom x-axis labels
axis(1, at = seq(1970, 2020, by = 10), labels = custom_labels)

From the graph, it can be seen that there is significant jump in the number of condominiums completed in 2010-2015. This is in alignment with Malaysia’s property boom period in early 2010s. This can be substantiated with the statistics in https://www.edgeprop.my/content/1901376/residential-property-market-performance-over-20-years, quoting “The most significant yearly change (in property price) is in 2012, with a 17.48% increase”.

3.3 Bivariate Analysis

    a) Top Condominium Listings by States

# Create a bar plot using base R barplot function
state_counts <- table(df1$State)

# Sort state_counts in descending order
state_counts_sorted <- state_counts[order(-state_counts)]

# Set up margins if needed
par(mar=c(6, 4, 2, 2))

# Plot the bar graph
barplot(state_counts_sorted, 
        col = "skyblue", 
        main = "Count of Malaysia Condominiums by State",
        cex.names = 0.8,  # Adjust font size for axis labels
        las = 2  # Rotate axis labels
)

The top three states with the most condominium listings are Selangor, Penang, and Kuala Lumpur. Coincidentally, these are also the top three most densely populated states in Malaysia. This suggests that the higher the population density, the greater the demand for high-rise residential units due to scarcer and more expensive land.
```
  b) Property Price by Number of Bathrooms and Bedrooms
```

# Step 1: Count the three most repeated states from previous analysis
top_states <- df1 %>%
  group_by(State) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:3)

# Step 2: Filter the data for the top three states
filtered_df_Top <- df1 %>% filter(State %in% top_states$State)

# Grouped column plot between top 3 states and average price with respect to number of bedrooms
ggplot(filtered_df_Top, aes(x = Bedroom, y = Price, fill = factor(State))) +
  stat_summary(fun = "mean", geom = "col", position = "dodge") +
  labs(title = "Bedroom vs Average Property Price",
       x = "Bedroom",
       y = "Average Property Price",
       fill = "State") +
  theme_minimal() +
  scale_x_discrete(limits = unique(filtered_df_Top$Bedroom))

# Grouped column plot between top 3 states and average price with respect to number of bathrooms
ggplot(filtered_df_Top, aes(x = Bathroom, y = Price, fill = factor(State))) +
  stat_summary(fun = "mean", geom = "col", position = "dodge") +
  labs(title = "Bathroom vs Average Property Price",
       x = "Bathroom",
       y = "Average property Price",
       fill = "State") +
  theme_minimal() +
  scale_x_discrete(limits = unique(filtered_df_Top$Bathroom))

3.3 Multivariate Analysis:

     a) Correlation Matrix

numeric_cols <- sapply(df1, is.numeric)
numeric_df1 <- df1[, numeric_cols]
# Select numeric and integer columns (excluding constants and "Ad-List")
constant_and_adlist_vars <- c("Ad-List", names(numeric_df1)[sapply(numeric_df1, function(x) length(unique(x)) == 1 | all(x %in% c(0, 1)))])
numeric_df1_no_const_adlist <- numeric_df1[, !(names(numeric_df1) %in% constant_and_adlist_vars)]

# Compute correlation matrix
correlation_matrix <- cor(numeric_df1_no_const_adlist, use = "complete.obs")  # "complete.obs" removes NA values

# Create correlation matrix plot with larger dimensions and bigger font size
library(corrplot)

png(filename = "mycorrplot.png", width = 1200, height = 800)
corrplot(correlation_matrix, method = "color", tl.cex = 1.2, addCoef.col = "black", number.cex = 1.2, tl.srt = 90,
         diag = FALSE, 
         tl.col = "black", tl.pos = "lt", width = 2000, height = 2000)
dev.off()

## png 
##   2

Based on the correlation matrix, the influencing factors on property prices can be ranked in descending order as follows: a) Property Size b) Number of Bathrooms c) Parking Lot d) Number of Bedrooms

      b) Property_size and price with respectively number of bedroom

ggplot(df1, aes(x = Property_Size, y = Price, color = factor(Bedroom))) +
  geom_point() +
  labs(title = "Price vs Property_Size",
       x = "Property_Size",
       y = "Price",
       color = "Bedroom") +
  theme_minimal()

The visualizations reveal a clear positive relationship among property size, the number of bedrooms, and property prices

      c) Property_size and price with respectively number of bedroom & bathroom

# Scatter plot between property_size and price with respectively number of bathroom
ggplot(df1, aes(x = Property_Size, y = Price, color = factor(Bathroom))) +
  geom_point() +
  labs(title = "Price vs Property_Size",
       x = "Property_Size",
       y = "Price",
       color = "Bathroom") +
  theme_minimal()

The visualizations reveal a clear positive relationship among property size, the number of bathrooms, and property prices

3.4 Further Data Pre-processing

3.4.1 Removal of Unnecessary Columns

Description: This column content includes links and emojis that may pose challenges for natural language processing (NLP) models . However, for the scope of this project, this feature will be removed and hence feature engineering on this aspect will not be undertaken.

df1 <- df1[, !grepl("^Description$", names(df1))]

Ad_List (Unique Identifier): Excluding unique identifiers is necessary to avoid overfitting as model might memorize the training data rather than learning underlying patterns

df1 <- df1[, !grepl("^Ad_List", names(df1))]

lon and lat: Excluding high variance columns for regression is necessary to avoid overfitting as model might memorize the training data rather than learning underlying patterns

df1 <- df1[, !(colnames(df1) %in% c("lon", "lat"))]

3.4.2 Transformation of Categorical Variable to Numerical Variable

Freehold or non Freehold

df1$Tenure_Type  <- ifelse(df1$Tenure_Type == "Freehold", 1, 0)

Property Type

df1$Property_Type <- as.factor(df1$Property_Type)
dummy_columns <- model.matrix(~Property_Type - 1, data = df1)
df1 <- cbind(df1, dummy_columns)

Floor Range

df1$Floor_Range <- as.factor(df1$Floor_Range)
dummy_columns <- model.matrix(~Floor_Range - 1, data = df1)
df1 <- cbind(df1, dummy_columns)

Land_Title

df1$Land_Title <- as.factor(df1$Land_Title)
dummy_columns <- model.matrix(~Land_Title - 1, data = df1)
df1 <- subset(df1, select = -Property_TypeApartment )

3.4.3 Standardize Features Naming

# Replace spaces with underscores in column names
colnames(df1) <- gsub(" ", "_", colnames(df1))

4. Regression

4.1 Multi-Linear Regression

The target variable is the “Price” where we aim the predict the house price based on the list of given variables

4.2. Get Numeric Data Only

numeric_cols <- sapply(df, is.numeric)
numeric_df1 <- df[, numeric_cols]

4.3 Split Train Test

set.seed(42)  # for reproducibility
index <- createDataPartition(numeric_df1$Price, p = 0.8, list = FALSE)
trainData <- numeric_df1[index, ]
testData <- numeric_df1[-index, ]

4.4 Defining Evaluation Metrics Function

#Defining  Evaluation metrics in a Function
evaluate_regression <- function(predictions, actual_values) {
  n <- length(actual_values)  # number of observations
  k <- length(predictions)    # number of predictors

  # R-squared
  rsquared <- cor(predictions, actual_values)^2

  # Mean Squared Error (MSE)
  mse <- mean((predictions - actual_values)^2)

  # Root Mean Squared Error (RMSE)
  rmse <- sqrt(mse)

  # Mean Absolute Error (MAE)
  mae <- mean(abs(predictions - actual_values))

  # Adjusted R-squared
  adj_r_squared <- 1 - ((1 - rsquared) * (n - 1) / (n - k - 1))

  # Return a list of metrics
  result <- list(
    rsquared = rsquared,
    mse = mse,
    rmse = rmse,
    mae = mae,
    adj_r_squared = adj_r_squared
  )

  return(result)
}

# Function to evaluate regression models on test and train data
evaluate_regression_models <- function(models_list, trainData, testData) {
  # Initialize an empty data frame to store results
  results_df <- data.frame()

  # Loop through each model
  for (model_name in names(models_list)) {
    # Get the model
    model <- models_list[[model_name]]

    # Make predictions on training data
    train_predictions <- as.vector(predict(model, trainData))
    # Get actual values for training data
    train_actual_values <- trainData$Price
    # Evaluate the model on training data
    train_results <- evaluate_regression(train_predictions, train_actual_values)

    # Make predictions on test data
    test_predictions <- as.vector(predict(model, testData))
    # Get actual values for test data
    test_actual_values <- testData$Price
    # Evaluate the model on test data
    test_results <- evaluate_regression(test_predictions, test_actual_values)

    # Add model results to the data frame
    results_df <- rbind(results_df, cbind(Model = model_name, Dataset = "Train", t(sapply(train_results, function(x) if (is.numeric(x)) round(x, 2) else x))))
    results_df <- rbind(results_df, cbind(Model = model_name, Dataset = "Test", t(sapply(test_results, function(x) if (is.numeric(x)) round(x, 2) else x))))
  }

  return(results_df)
}

4.5 Training & Evaluation of the Baseline Models

# Function to fit regression models
fit_regression_models <- function(train_data) {
  # Linear Regression
  lm_model <- lm(Price ~ ., data = trainData)

  # SVM Regression
  svm_model <- svm(Price ~ ., data = trainData)

  # GBM Regression
  gbm_model <- gbm(Price ~ ., data = trainData, distribution = "gaussian", n.trees = 100, interaction.depth = 4, shrinkage = 0.01)

  # Store models in a list
  models_list <- list(
    linear_regression = lm_model,
    svm_regression = svm_model,
    gbm_regression = gbm_model
  )

  return(models_list)
}

# Fit regression models on training data
regression_models_train <- fit_regression_models(trainData)

# Evaluate regression models on both training and test data
results_df <- evaluate_regression_models(regression_models_train, trainData, testData)

## Using 100 trees...
## 
## Using 100 trees...

# Print the results
print(results_df)

##               Model Dataset rsquared            mse      rmse       mae
## 1 linear_regression   Train     0.65 40384696968.94 200959.44 111483.03
## 2 linear_regression    Test     0.59 33894681701.83 184105.08 114317.09
## 3    svm_regression   Train     0.73 36702847966.97 191579.87  74262.34
## 4    svm_regression    Test     0.66 29667257952.38 172241.86  91635.13
## 5    gbm_regression   Train     0.67 52999248239.39 230215.66 131754.36
## 6    gbm_regression    Test     0.55 45361298149.97 212981.92 131281.63
##   adj_r_squared
## 1       1064.27
## 2        311.81
## 3        831.64
## 4        256.82
## 5       1005.72
## 6        337.67

4.7 Hyperparamter Tuning the Models

# Function to fit regression models with hyperparameter tuning
fit_regression_models_tuned <- function(train_data) {
  # Linear Regression
  lm_model <- lm(Price ~ ., data = train_data)

  # Suppress warnings for SVM hyperparameter tuning
  suppressWarnings({
    svm_tune_result <- tune(svm, Price ~ ., data = train_data,
                            ranges = list(epsilon = seq(0.1, 0.9, 0.2), cost = 2^(2:8)))
  })

  # Extract the best model from the tune result
  best_svm_model <- svm_tune_result$best.model

  # Suppress warnings for GBM
  suppressWarnings({
    gbm_model <- gbm(Price ~ ., data = train_data, distribution = "gaussian", n.trees = 100,
                     interaction.depth = 4, shrinkage = 0.01)
  })

  # Store models in a list
  models_list <- list(
    linear_regression = lm_model,
    svm_regression = best_svm_model,
    gbm_regression = gbm_model
  )

  return(models_list)
}

# Fit regression models with hyperparameter tuning on training data

regression_models_train_tuned <- fit_regression_models_tuned(trainData)

results_df_tuned <- evaluate_regression_models(regression_models_train_tuned, trainData, testData)

## Using 100 trees...
## 
## Using 100 trees...

print(results_df_tuned)

##               Model Dataset rsquared            mse      rmse       mae
## 1 linear_regression   Train     0.65 40384696968.94 200959.44 111483.03
## 2 linear_regression    Test     0.59 33894681701.83 184105.08 114317.09
## 3    svm_regression   Train     0.92 10090155093.91 100449.76  47042.87
## 4    svm_regression    Test     0.66 27849793336.09 166882.57   95101.9
## 5    gbm_regression   Train     0.67 53795473989.95 231938.51 131963.86
## 6    gbm_regression    Test     0.55 45697128038.49 213768.87 131443.46
##   adj_r_squared
## 1       1064.27
## 2        311.81
## 3        243.75
## 4        256.46
## 5       1007.45
## 6        342.66

4.1.6 Model Evaluation

To determine the best-performing model, we consider several metrics across both training and test datasets. The chosen metrics include R-squared, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Adjusted R-squared.

Based on the tuned models, the linear regression model outperforms both SVM regression and GBM regression in terms of balanced R-squared and Adjusted R-squared in both train and test dataset. It has the highest R-squared value (0.65) in train dataset and R-squared value (0.59) in test dataset

Linear regression model has the highest Adjusted R-squared (1064.27) in train dataset indicating strong predictive performance on the training set.

While the GBM regression model performs well on the training dataset, it shows a slightly lower R-squared value on the test dataset compared to linear regression. However, it still outperforms SVM regression in terms of R-squared and Adjusted R-squared.

Based on the evaluation metrics, the linear regression model is considered the best-performing model for this dataset. It demonstrates superior predictive capabilities, especially on the training dataset.

options(warn = 0)

5. Classfication

5.1. Classifying the condominiums based on Price Range

The condominiums are classified into three ranges based on the following:

affordable Condominiums (<RM250000)
Mid-range condominiums (RM250,001-RM500,000)
High-range condominiums (>RM500,001)

df2 <- data.frame(df1)

# Create a new column Price_Category
df2$Price_Category <- as.numeric(cut(df2$Price,
                                      breaks = c(-Inf, 250000, 500000, Inf),
                                      labels = c("1", "2", "3"),
                                      include.lowest = TRUE, right = FALSE))


# View the number of rows in each Price_Category
price_category_counts <- table(df2$Price_Category)

# Create a bar chart with y-axis limit set to 2000
barplot(price_category_counts, main = "Number of Rows in Each Price Category",
        xlab = "Price Category", ylab = "Number of Rows", col = "skyblue", border = "black",
        ylim = c(0, 2500), names.arg = c("Affordable", "Mid-Range", "High-Range"))

# Add text labels for each bar
text(x = barplot(price_category_counts, plot = FALSE), y = price_category_counts + 50, 
     labels = price_category_counts, col = "black", cex = 1.2, pos = 3)

From the graph above, it can be seen that medium category has a much higher count than the rest two categories, This means we have an imbalanced dataset and it might affect the classification models such that may become biased towards the majority class.

5.2 Split Train Test

df2 <- df2[, !colnames(df2) %in% c("Price")]
df2 <- df2[, sapply(df2, is.numeric)]
set.seed(42)
# Generate random indices for train and test sets
index <- sample(nrow(df2), 0.8 * nrow(df2))
trainData <- df2[index, ]
testData <- df2[-index, ]

5.3 Defining Evaluation Metrics Function

# Function to calculate evaluation metrics
calculate_metrics <- function(y_true, y_pred) {
  confusion_matrix <- table(y_true, y_pred)

  # Accuracy
  accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

  # Precision, Recall, F1-Score for each category
  precision <- diag(confusion_matrix) / rowSums(confusion_matrix)
  recall <- diag(confusion_matrix) / colSums(confusion_matrix)
  f1_score <- 2 * (precision * recall) / (precision + recall)

  # Create a data frame to store the results
  metrics_df <- data.frame(
    Accuracy = accuracy,
    Precision_Level_1 = precision[1],
    Precision_Level_2 = precision[2],
    Precision_Level_3 = precision[3],
    Recall_Level_1 = recall[1],
    Recall_Level_2 = recall[2],
    Recall_Level_3 = recall[3],
    F1_Score_Level_1 = f1_score[1],
    F1_Score_Level_2 = f1_score[2],
    F1_Score_Level_3 = f1_score[3]
  )

  return(metrics_df)
}

# Function to evaluate classification models on test and train data
evaluate_classification_models <- function(models_list, trainData, testData) {
  # Initialize an empty data frame to store results
  results_df <- data.frame()

  # Loop through each model
  for (model_name in names(models_list)) {
    # Get the model
    model <- models_list[[model_name]]

    # Make predictions on training data
    train_predictions <- predict(model, newdata = trainData, type = "class")
    # Get actual values for training data
    train_actual_values <- trainData$Price_Category
    # Evaluate the model on training data
    train_results <- calculate_metrics(train_actual_values, train_predictions)

    # Make predictions on test data
    test_predictions <- predict(model, newdata = testData, type = "class")
    # Get actual values for test data
    test_actual_values <- testData$Price_Category
    # Evaluate the model on test data
    test_results <- calculate_metrics(test_actual_values, test_predictions)

    # Add model results to the data frame
    results_df <- rbind(results_df, cbind(Model = model_name, Dataset = "Train", train_results))
    results_df <- rbind(results_df, cbind(Model = model_name, Dataset = "Test", test_results))
  }

  return(results_df)
}

5.4 Training & Evaluation of Model

# Fit a multinomial logistic regression model
multinom_model <- multinom(Price_Category ~ ., data = trainData)

## # weights:  126 (82 variable)
## initial  value 3333.189684 
## iter  10 value 2332.484443
## iter  20 value 2100.393169
## iter  30 value 1903.301975
## iter  40 value 1733.496101
## iter  50 value 1692.078640
## iter  60 value 1683.681846
## iter  70 value 1679.621698
## iter  80 value 1679.465345
## iter  90 value 1679.436559
## iter 100 value 1679.426556
## final  value 1679.426556 
## stopped after 100 iterations

# Create a list of models
models_list <- list(Multinomial_Logistic = multinom_model)

# Evaluate all models on test data
evaluation_results <- evaluate_classification_models(models_list, trainData, testData)

# Print the result
print(evaluation_results)

##                   Model Dataset  Accuracy Precision_Level_1 Precision_Level_2
## 1  Multinomial_Logistic   Train 0.7686223         0.6741259         0.8521082
## 11 Multinomial_Logistic    Test 0.7483531         0.7105263         0.8179487
##    Precision_Level_3 Recall_Level_1 Recall_Level_2 Recall_Level_3
## 1          0.6794521      0.7875817      0.7468285      0.8144499
## 11         0.6368715      0.7988166      0.7299771      0.7450980
##    F1_Score_Level_1 F1_Score_Level_2 F1_Score_Level_3
## 1         0.7264506        0.7960024        0.7408514
## 11        0.7520891        0.7714631        0.6867470

The multinomial logistic regression model performed reasonably well on both the training and test datasets. The model achieved an accuracy of approximately 76.70% on the training set and 73.91% on the test set. In terms of precision, recall, and F1-score for each class, the results indicate that the model demonstrated good performance across multiple evaluation metrics. However, there is a slight drop in performance when transitioning from the training to the test dataset, suggesting some generalization challenges that could be further explored and addressed for improved model robustness.

6. Conclusion & Future Works

In this project, two primary objectives were addressed: predicting condominium prices through regression and classifying condominiums into three categories (affordable, middle-range, and high-end) using classification models.

Regression models, including linear regression, SVM regression, and GBM regression, were evaluated for their predictive performance. The linear regression model emerged as the best-performing model with the highest R-squared (0.59) and Adjusted R-squared (1061.71) values, indicating strong predictive capabilities, especially on the training set. Although the GBM regression model performed well on the training set, it exhibited a slightly lower R-squared value on the test set compared to linear regression.

For classification tasks, the multinomial logistic regression model demonstrated satisfactory performance, achieving an accuracy of approximately 76.70% on the training set and 73.91% on the test set. Despite its overall good performance, there was a minor drop in metrics when transitioning from training to test datasets, indicating potential generalization challenges that require further investigation and refinement.

To address class imbalance, the use of SMOTE (Synthetic Minority Over-sampling Technique) was suggested. Implementing SMOTE can help balance class distribution, potentially enhancing the model’s ability to capture patterns in minority classes. Future work should focus on optimizing SMOTE parameters, exploring additional features, and refining model hyperparameters for improved predictive capabilities. The identified areas for improvement, coupled with ongoing research and model refinement, will contribute to the development of a robust and reliable predictive model for condominium prices in Malaysia.