Real Estate Market Analysis for Unfurnished 1 Bedroom Property in Kuala Lumpur
For this assignment, we will be exploring the real estate market in Kuala Lumpur. As such, we will be using a dataset on property listings in Kuala Lumpur obtained from Kaggle via the following link : https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur This dataset represents property listings across Kuala Lumpur with eight attributes as follows :
1. Location - Area where property is located within Kuala Lumpur
2. Price - Property market price in Ringgit Malaysia
3. Rooms - Number of rooms for a given property
4. Bathrooms - Number of bathrooms for a given property
5. Car.Parks - Number of car parks provided in a shared parking area, porch or garage
6. Property.Type - Type of property (i.e Condominium, Soho)
7. Size - Size in square feet of a given property
8. Furnishing - Indicates wheter a given property is furnished or unfurnished.
There are 53874 observations in this dataset. For simplicity, we will only select 1 bedroom, unfurnished property as a subset of our larger data. We will also ignore the different size types between Land Area and Built-up.
Installing Tidyverse
For our analysis, we will install tidyverse which is a collection of packages used for data analysis from https://www.tidyverse.org/packages/. Tidyverse consists of the following packages : ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr and forcats.We will mainly focus on tidyr, dplyr and ggplot2.
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Tidyverse Packages
RegEx
Additionally, we will also explore the use of RegEx to help clean the data. The RegEx that we will be using is \(.?\) * where,
_
Importing Dataset and Subsetting Data
First, we should set up a working directory for our data using “setwd”. To check if we are using the right directory, we can use the function “getwd”.
setwd("/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1")
getwd()
## [1] "/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1"
Next, we will load our dataset which is in .csv file format, using the function “read.csv”. This dataset will be stored in dataframe format in R.We will also extract a subset of the whole dataset by filtering out properties with 1 bedroom and no furnishing.
kl<-read.csv("data_kaggle_property_kl.csv", header = TRUE, sep = ",")
kl_df<-filter(kl,Rooms=="1" & Furnishing =="Unfurnished")
head(kl_df)
## Location Price Rooms Bathrooms Car.Parks
## 1 Bukit Bintang, Kuala Lumpur RM750,000 1 1 1
## 2 Bukit Bintang, Kuala Lumpur RM800,000 1 1 1
## 3 KL Eco City, Kuala Lumpur RM835,000 1 1 1
## 4 KL City, Kuala Lumpur RM810,000 1 1 NA
## 5 Bukit Bintang, Kuala Lumpur RM765,000 1 1 1
## 6 Mont Kiara, Kuala Lumpur RM570,000 1 1 NA
## Property.Type Size Furnishing
## 1 Condominium Built-up : 538 sq. ft. Unfurnished
## 2 Condominium Built-up : 538 sq. ft. Unfurnished
## 3 Condominium Built-up : 700 sq. ft. Unfurnished
## 4 Serviced Residence (Intermediate) Built-up : 732 sq. ft. Unfurnished
## 5 Condominium Built-up : 538 sq. ft. Unfurnished
## 6 Serviced Residence Built-up : 678 sq. ft. Unfurnished
Analyze the basic structure of KL Property dataset.
str(kl_df)
## 'data.frame': 50 obs. of 8 variables:
## $ Location : chr "Bukit Bintang, Kuala Lumpur" "Bukit Bintang, Kuala Lumpur" "KL Eco City, Kuala Lumpur" "KL City, Kuala Lumpur" ...
## $ Price : chr "RM750,000 " "RM800,000 " "RM835,000 " "RM810,000 " ...
## $ Rooms : chr "1" "1" "1" "1" ...
## $ Bathrooms : int 1 1 1 1 1 1 1 2 1 1 ...
## $ Car.Parks : int 1 1 1 NA 1 NA NA 1 NA 1 ...
## $ Property.Type: chr "Condominium" "Condominium" "Condominium" "Serviced Residence (Intermediate)" ...
## $ Size : chr "Built-up : 538 sq. ft." "Built-up : 538 sq. ft." "Built-up : 700 sq. ft." "Built-up : 732 sq. ft." ...
## $ Furnishing : chr "Unfurnished" "Unfurnished" "Unfurnished" "Unfurnished" ...
Data Cleaning and Formatting
From this analysis, we see that the data is in two main format which are character and integer. We would like to clean our data in two parts as follows.
kl_df1<-kl_df %>%
mutate(across(all_of("Location"), ~gsub(", Kuala Lumpur", "",.))) %>%
mutate(across(all_of("Property.Type"), ~gsub(" *\\(.*?\\) *", "",.))) %>%
mutate(across(starts_with("Price"), ~gsub("[RM,]", "",.) %>% as.numeric)) %>%
separate(.,col="Size",into = c("Size.Type","Size"),sep=":") %>%
mutate(across(all_of("Size"), ~gsub(" sq. ft.", "",.) %>% as.numeric))
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [37, 40,
## 50].
head(kl_df1)
## Location Price Rooms Bathrooms Car.Parks Property.Type Size.Type
## 1 Bukit Bintang 750000 1 1 1 Condominium Built-up
## 2 Bukit Bintang 800000 1 1 1 Condominium Built-up
## 3 KL Eco City 835000 1 1 1 Condominium Built-up
## 4 KL City 810000 1 1 NA Serviced Residence Built-up
## 5 Bukit Bintang 765000 1 1 1 Condominium Built-up
## 6 Mont Kiara 570000 1 1 NA Serviced Residence Built-up
## Size Furnishing
## 1 538 Unfurnished
## 2 538 Unfurnished
## 3 700 Unfurnished
## 4 732 Unfurnished
## 5 538 Unfurnished
## 6 678 Unfurnished
Simplifying Dataset
Now we have “Price” and “Size” both in numerical format.Next, we can simplify our table by removing columns that do not carry meaningful values. Since we know that our data represents all unfurnished one bedroom apartments across KL, the column “Rooms” and “Furnishing” can be removed from our dataset.We can also ignore Size Type as stated in “Introduction” tab. Additionally, we would also want to remove “Car Parks” for this analysis since our focus is mainly on property price, size and location.
kl_df1<-subset(kl_df1, select =-c(Rooms,Furnishing,Car.Parks,Size.Type))
head(kl_df1)
## Location Price Bathrooms Property.Type Size
## 1 Bukit Bintang 750000 1 Condominium 538
## 2 Bukit Bintang 800000 1 Condominium 538
## 3 KL Eco City 835000 1 Condominium 700
## 4 KL City 810000 1 Serviced Residence 732
## 5 Bukit Bintang 765000 1 Condominium 538
## 6 Mont Kiara 570000 1 Serviced Residence 678
Removing Missing Data
Our new dataset now has only 5 variables, all in the right format. Finally, let’s remove missing values from our dataset.
colSums(is.na(kl_df1))
## Location Price Bathrooms Property.Type Size
## 0 0 1 0 3
kl_df1<-kl_df1[complete.cases(kl_df1),]
head(kl_df1)
## Location Price Bathrooms Property.Type Size
## 1 Bukit Bintang 750000 1 Condominium 538
## 2 Bukit Bintang 800000 1 Condominium 538
## 3 KL Eco City 835000 1 Condominium 700
## 4 KL City 810000 1 Serviced Residence 732
## 5 Bukit Bintang 765000 1 Condominium 538
## 6 Mont Kiara 570000 1 Serviced Residence 678
str(kl_df1)
## 'data.frame': 46 obs. of 5 variables:
## $ Location : chr "Bukit Bintang" "Bukit Bintang" "KL Eco City" "KL City" ...
## $ Price : num 750000 800000 835000 810000 765000 570000 324000 530000 770000 470000 ...
## $ Bathrooms : int 1 1 1 1 1 1 1 2 1 1 ...
## $ Property.Type: chr "Condominium" "Condominium" "Condominium" "Serviced Residence" ...
## $ Size : num 538 538 700 732 538 678 500 715 1300 705 ...
We are left with a dataset consisting of 5 variables and 50 observations, ready for data analysis!
_
Data Summary
Let’s look at the overall summary of the dataset using the the function summary()
summary(kl_df1)
## Location Price Bathrooms Property.Type
## Length:46 Min. : 200000 Min. :1.000 Length:46
## Class :character 1st Qu.: 400000 1st Qu.:1.000 Class :character
## Mode :character Median : 562500 Median :1.000 Mode :character
## Mean : 577978 Mean :1.217
## 3rd Qu.: 750000 3rd Qu.:1.000
## Max. :1050000 Max. :2.000
## Size
## Min. : 430.0
## 1st Qu.: 619.5
## Median : 710.0
## Mean : 710.7
## 3rd Qu.: 767.8
## Max. :1300.0
Numerical attributes gave us the mean and Tukey5-number summary. From this quick analysis, we see that the that the data distributions for Price and Size are almost perfect normal distributions since their mean and median and relatively close to each other. Now lets plot histograms for both attributes.
Histogram | Visualization for One Variable - continuous x
ggplot(data=kl_df1, aes(Price)) +
geom_histogram()+theme_gray()+
labs(title="Histogram for Price of Studio Apartment in KL")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=kl_df1, aes(Size)) +
geom_histogram()+theme_gray()+
labs(title="Histogram for Size of Studio Apartment in KL")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histograms confirms our theory on property Size. However, the histogram reveals a bimodal data distribution for property Price. This is why it is important to visualize our data distribution. Now we would like to explore the relationship between price and area to see if there is any linearity.
Scatter Plot with Smoother | Visualization for Two Variables - continuous x, continuous y
ggplot(data=kl_df1, aes(Price,Size)) +
geom_jitter()+
geom_smooth(method=loess)+
theme_gray()+
labs(title="Correlation between property price and area")
## `geom_smooth()` using formula 'y ~ x'
From this graph, we see that while there is some positive relationship between property price and area, the relationship is non-linear. The smoother line and band shows that for most of the property, its price increase with an increase in size, although it is not linear and is not applicable for some properties.
Next, lets analyse the correlation between median price and location to see the “hotspots” in Kuala Lumpur with high property price.
Bar chart | Visualization for Two Variables - discrete x, continuous y
ggplot(data=kl_df1, aes(Location,Price)) +
geom_bar(fun = "median", stat = "summary") +
theme_gray()+
theme(axis.text.x = element_text(angle = 90))+
labs(title="Correlation between property median price and location")
From the bar charts, we can infer that properties in KLCC is the most expensive in Kuala Lumpur followed by KL Eco City and KL City. The least expensive places for properties are Sri Petaling and Sri Hartamas.Finally, lets look at the correlation between property type and location.
Count Chart | Visualization for Two Variables - discrete x, discrete y
ggplot(data=kl_df1, aes(Property.Type,Location)) +
geom_count()+
theme_gray()+
labs(title="Correlation between property type and location")
Overall, the graph tells us that most 1 bedroom unfurnished properties in Kuala Lumpur are Serviced Residences.Our hotspots area have almost excusively Serviced Residence types. Another interesting finding is that Bukit Bintang has only Condominiums for unfurnished 1 bedroom property.
References
S, J. (2019, July 4). Property Listings in Kuala Lumpur. Kaggle. https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur.
Tidyverse packages. Tidyverse. (n.d.). https://www.tidyverse.org/packages/.
Rstudio. (n.d.). rstudio/cheatsheets. GitHub. https://github.com/rstudio/cheatsheets.