Assignment 1

Introduction

Real Estate Market Analysis for Unfurnished 1 Bedroom Property in Kuala Lumpur

For this assignment, we will be exploring the real estate market in Kuala Lumpur. As such, we will be using a dataset on property listings in Kuala Lumpur obtained from Kaggle via the following link : https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur This dataset represents property listings across Kuala Lumpur with eight attributes as follows :

1. Location - Area where property is located within Kuala Lumpur
2. Price - Property market price in Ringgit Malaysia
3. Rooms - Number of rooms for a given property
4. Bathrooms - Number of bathrooms for a given property
5. Car.Parks - Number of car parks provided in a shared parking area, porch or garage
6. Property.Type - Type of property (i.e Condominium, Soho)
7. Size - Size in square feet of a given property
8. Furnishing - Indicates wheter a given property is furnished or unfurnished.

There are 53874 observations in this dataset. For simplicity, we will only select 1 bedroom, unfurnished property as a subset of our larger data. We will also ignore the different size types between Land Area and Built-up.

Packages Info

Installing Tidyverse

For our analysis, we will install tidyverse which is a collection of packages used for data analysis from https://www.tidyverse.org/packages/. Tidyverse consists of the following packages : ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr and forcats.We will mainly focus on tidyr, dplyr and ggplot2.

#install.packages("tidyverse")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Tidyverse Packages

tidyr : used to create tidy data
diplyr : used to manipulate data
ggplot2 : used to visualize data

RegEx

Additionally, we will also explore the use of RegEx to help clean the data. The RegEx that we will be using is \(.?\) * where,

Symbols *? finds 0 or more spaces before (and after) the parentheses.
Symbols \ C and \ ) acts as an escape for ( and ) since they are special symbols in a RegEx

Data Preparation

Importing Dataset and Subsetting Data

First, we should set up a working directory for our data using “setwd”. To check if we are using the right directory, we can use the function “getwd”.

setwd("/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1")
getwd()

## [1] "/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1"

Next, we will load our dataset which is in .csv file format, using the function “read.csv”. This dataset will be stored in dataframe format in R.We will also extract a subset of the whole dataset by filtering out properties with 1 bedroom and no furnishing.

kl<-read.csv("data_kaggle_property_kl.csv", header = TRUE, sep = ",")
kl_df<-filter(kl,Rooms=="1" & Furnishing =="Unfurnished")
head(kl_df)

##                      Location      Price Rooms Bathrooms Car.Parks
## 1 Bukit Bintang, Kuala Lumpur RM750,000      1         1         1
## 2 Bukit Bintang, Kuala Lumpur RM800,000      1         1         1
## 3   KL Eco City, Kuala Lumpur RM835,000      1         1         1
## 4       KL City, Kuala Lumpur RM810,000      1         1        NA
## 5 Bukit Bintang, Kuala Lumpur RM765,000      1         1         1
## 6    Mont Kiara, Kuala Lumpur RM570,000      1         1        NA
##                       Property.Type                   Size  Furnishing
## 1                       Condominium Built-up : 538 sq. ft. Unfurnished
## 2                       Condominium Built-up : 538 sq. ft. Unfurnished
## 3                       Condominium Built-up : 700 sq. ft. Unfurnished
## 4 Serviced Residence (Intermediate) Built-up : 732 sq. ft. Unfurnished
## 5                       Condominium Built-up : 538 sq. ft. Unfurnished
## 6                Serviced Residence Built-up : 678 sq. ft. Unfurnished

Analyze the basic structure of KL Property dataset.

str(kl_df)

## 'data.frame':    50 obs. of  8 variables:
##  $ Location     : chr  "Bukit Bintang, Kuala Lumpur" "Bukit Bintang, Kuala Lumpur" "KL Eco City, Kuala Lumpur" "KL City, Kuala Lumpur" ...
##  $ Price        : chr  "RM750,000 " "RM800,000 " "RM835,000 " "RM810,000 " ...
##  $ Rooms        : chr  "1" "1" "1" "1" ...
##  $ Bathrooms    : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ Car.Parks    : int  1 1 1 NA 1 NA NA 1 NA 1 ...
##  $ Property.Type: chr  "Condominium" "Condominium" "Condominium" "Serviced Residence (Intermediate)" ...
##  $ Size         : chr  "Built-up : 538 sq. ft." "Built-up : 538 sq. ft." "Built-up : 700 sq. ft." "Built-up : 732 sq. ft." ...
##  $ Furnishing   : chr  "Unfurnished" "Unfurnished" "Unfurnished" "Unfurnished" ...

Data Cleaning and Formatting

From this analysis, we see that the data is in two main format which are character and integer. We would like to clean our data in two parts as follows.

Character : For “Location”, Property Type“,”Furnishing“, we would want to leave it as character format. The two variable that we would like to clean are”Location" and “Property Type”
- Location : Since all of our properties are in Kuala Lumpur, we would like to remove “,Kuala Lumpur” so that our data would only show the areas within KL.
- Property Type : For this, we would want to simplify property types and remove all items in the parentheses. RegEx can be used to capture items within parentheses.
Integer : For “Rooms”, “Bathrooms” and “Car Parks”, we would also leave it as it is. The two variable that we would want in numerical formats are “Price” and “Size”.

kl_df1<-kl_df %>% 
  mutate(across(all_of("Location"), ~gsub(", Kuala Lumpur", "",.))) %>% 
  mutate(across(all_of("Property.Type"), ~gsub(" *\\(.*?\\) *", "",.))) %>% 
  mutate(across(starts_with("Price"), ~gsub("[RM,]", "",.) %>% as.numeric)) %>%  
  separate(.,col="Size",into = c("Size.Type","Size"),sep=":") %>%
  mutate(across(all_of("Size"), ~gsub(" sq. ft.", "",.) %>% as.numeric))

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [37, 40,
## 50].

head(kl_df1)

##        Location  Price Rooms Bathrooms Car.Parks      Property.Type Size.Type
## 1 Bukit Bintang 750000     1         1         1        Condominium Built-up 
## 2 Bukit Bintang 800000     1         1         1        Condominium Built-up 
## 3   KL Eco City 835000     1         1         1        Condominium Built-up 
## 4       KL City 810000     1         1        NA Serviced Residence Built-up 
## 5 Bukit Bintang 765000     1         1         1        Condominium Built-up 
## 6    Mont Kiara 570000     1         1        NA Serviced Residence Built-up 
##   Size  Furnishing
## 1  538 Unfurnished
## 2  538 Unfurnished
## 3  700 Unfurnished
## 4  732 Unfurnished
## 5  538 Unfurnished
## 6  678 Unfurnished

Simplifying Dataset

Now we have “Price” and “Size” both in numerical format.Next, we can simplify our table by removing columns that do not carry meaningful values. Since we know that our data represents all unfurnished one bedroom apartments across KL, the column “Rooms” and “Furnishing” can be removed from our dataset.We can also ignore Size Type as stated in “Introduction” tab. Additionally, we would also want to remove “Car Parks” for this analysis since our focus is mainly on property price, size and location.

kl_df1<-subset(kl_df1, select =-c(Rooms,Furnishing,Car.Parks,Size.Type))
head(kl_df1)

##        Location  Price Bathrooms      Property.Type Size
## 1 Bukit Bintang 750000         1        Condominium  538
## 2 Bukit Bintang 800000         1        Condominium  538
## 3   KL Eco City 835000         1        Condominium  700
## 4       KL City 810000         1 Serviced Residence  732
## 5 Bukit Bintang 765000         1        Condominium  538
## 6    Mont Kiara 570000         1 Serviced Residence  678

Removing Missing Data

Our new dataset now has only 5 variables, all in the right format. Finally, let’s remove missing values from our dataset.

colSums(is.na(kl_df1))

##      Location         Price     Bathrooms Property.Type          Size 
##             0             0             1             0             3

kl_df1<-kl_df1[complete.cases(kl_df1),]
head(kl_df1)

##        Location  Price Bathrooms      Property.Type Size
## 1 Bukit Bintang 750000         1        Condominium  538
## 2 Bukit Bintang 800000         1        Condominium  538
## 3   KL Eco City 835000         1        Condominium  700
## 4       KL City 810000         1 Serviced Residence  732
## 5 Bukit Bintang 765000         1        Condominium  538
## 6    Mont Kiara 570000         1 Serviced Residence  678

str(kl_df1)

## 'data.frame':    46 obs. of  5 variables:
##  $ Location     : chr  "Bukit Bintang" "Bukit Bintang" "KL Eco City" "KL City" ...
##  $ Price        : num  750000 800000 835000 810000 765000 570000 324000 530000 770000 470000 ...
##  $ Bathrooms    : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ Property.Type: chr  "Condominium" "Condominium" "Condominium" "Serviced Residence" ...
##  $ Size         : num  538 538 700 732 538 678 500 715 1300 705 ...

We are left with a dataset consisting of 5 variables and 50 observations, ready for data analysis!

Data Analysis

Data Summary

Let’s look at the overall summary of the dataset using the the function summary()

summary(kl_df1)

##    Location             Price           Bathrooms     Property.Type     
##  Length:46          Min.   : 200000   Min.   :1.000   Length:46         
##  Class :character   1st Qu.: 400000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median : 562500   Median :1.000   Mode  :character  
##                     Mean   : 577978   Mean   :1.217                     
##                     3rd Qu.: 750000   3rd Qu.:1.000                     
##                     Max.   :1050000   Max.   :2.000                     
##       Size       
##  Min.   : 430.0  
##  1st Qu.: 619.5  
##  Median : 710.0  
##  Mean   : 710.7  
##  3rd Qu.: 767.8  
##  Max.   :1300.0

Numerical attributes gave us the mean and Tukey5-number summary. From this quick analysis, we see that the that the data distributions for Price and Size are almost perfect normal distributions since their mean and median and relatively close to each other. Now lets plot histograms for both attributes.

Histogram | Visualization for One Variable - continuous x

ggplot(data=kl_df1, aes(Price)) + 
  geom_histogram()+theme_gray()+
  labs(title="Histogram for Price of Studio Apartment in KL")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=kl_df1, aes(Size)) + 
  geom_histogram()+theme_gray()+
  labs(title="Histogram for Size of Studio Apartment in KL")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histograms confirms our theory on property Size. However, the histogram reveals a bimodal data distribution for property Price. This is why it is important to visualize our data distribution. Now we would like to explore the relationship between price and area to see if there is any linearity.

Scatter Plot with Smoother | Visualization for Two Variables - continuous x, continuous y

ggplot(data=kl_df1, aes(Price,Size)) + 
  geom_jitter()+
  geom_smooth(method=loess)+
  theme_gray()+
  labs(title="Correlation between property price and area")

## `geom_smooth()` using formula 'y ~ x'

From this graph, we see that while there is some positive relationship between property price and area, the relationship is non-linear. The smoother line and band shows that for most of the property, its price increase with an increase in size, although it is not linear and is not applicable for some properties.

Next, lets analyse the correlation between median price and location to see the “hotspots” in Kuala Lumpur with high property price.

Bar chart | Visualization for Two Variables - discrete x, continuous y

ggplot(data=kl_df1, aes(Location,Price)) + 
  geom_bar(fun = "median", stat = "summary") +
  theme_gray()+
  theme(axis.text.x = element_text(angle = 90))+
  labs(title="Correlation between property median price and location")

From the bar charts, we can infer that properties in KLCC is the most expensive in Kuala Lumpur followed by KL Eco City and KL City. The least expensive places for properties are Sri Petaling and Sri Hartamas.Finally, lets look at the correlation between property type and location.

Count Chart | Visualization for Two Variables - discrete x, discrete y

ggplot(data=kl_df1, aes(Property.Type,Location)) + 
  geom_count()+
  theme_gray()+
  labs(title="Correlation between property type and location")

Overall, the graph tells us that most 1 bedroom unfurnished properties in Kuala Lumpur are Serviced Residences.Our hotspots area have almost excusively Serviced Residence types. Another interesting finding is that Bukit Bintang has only Condominiums for unfurnished 1 bedroom property.

References

References

S, J. (2019, July 4). Property Listings in Kuala Lumpur. Kaggle. https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur.
Tidyverse packages. Tidyverse. (n.d.). https://www.tidyverse.org/packages/.
Rstudio. (n.d.). rstudio/cheatsheets. GitHub. https://github.com/rstudio/cheatsheets.