Assignment 1

##Introduction {.tabset.tabset-pills}

For this assignment, we will be aexploring the real estate market in Kuala Lumpur. As such, we will be using a dataset on property listings in Kuala Lumpur obtained from Kaggle via the following link : https://www.kaggle.com/dragonduck/property-listings-in-kuala-lumpur This dataset represents property listings across Kuala Lumpur with eight attributes as follows :

1. Location - Area where property is located within Kuala Lumpur
2. Price - Property market price in Ringgit Malaysia
3. Rooms - Number of rooms for a given property
4. Bathrooms - Number of bathrooms for a given property
5. Car.Parks - Number of car parks provided in a shared parking area, porch or garage
6. Property.Type - Type of property (i.e Condominium, Soho)
7. Size - Size in square feet of a given property
8. Furnishing - Indicates wheter a given property is furnished or unfurnished.

There are 53883 observations in this dataset. For simplicity, we will only select 1 bedroom, unfurnished property as a subset of our larger data. We will also ignore the different size types between Land Area and Built-up.

##Packages Info {.tabset.tabset-pills} https://www.tidyverse.org/packages/

#install.packages("tidyverse")
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Here is the regex you want:

gsub( " \(.?\) *“,”“, x) [1]”Keep me. Again keep me. Again again keep me." It works like this:

? finds 0 or more spaces before (and after) the parentheses. Since ( and ) are special symbols in a regex, you need to escape these, i.e. (\( The .? is a wildcard find to find all characters, where the ? means to find in a non-greedy way. This is necessary because regex is greedy by default. In other words, by default the regex will start the match at the first opening parentheses and ends the match at the last closing parentheses. https://stackoverflow.com/questions/13529360/replace-text-within-parenthesis-in-r

##Data Preparation {.tabset.tabset-pills}

First, we should set up a working directory for our data using “setwd”. To check if we are using the right directory, we can use the function “getwd”.

setwd("/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1")
getwd()

## [1] "/Users/ainaanajihah/Documents/Masters/7004-Programming in Data Science/Assignment 1"

Next, we will load our dataset which is in .csv file format, using the function “read.csv”. This dataset will be stored in dataframe format in R.

kl<-read.csv("data_kaggle_property_kl.csv", header = TRUE, sep = ",")
kl_df<-filter(kl,Rooms=="1" & Furnishing =="Unfurnished")
head(kl_df)

##                      Location      Price Rooms Bathrooms Car.Parks
## 1 Bukit Bintang, Kuala Lumpur RM750,000      1         1         1
## 2 Bukit Bintang, Kuala Lumpur RM800,000      1         1         1
## 3   KL Eco City, Kuala Lumpur RM835,000      1         1         1
## 4       KL City, Kuala Lumpur RM810,000      1         1        NA
## 5 Bukit Bintang, Kuala Lumpur RM765,000      1         1         1
## 6    Mont Kiara, Kuala Lumpur RM570,000      1         1        NA
##                       Property.Type                   Size  Furnishing
## 1                       Condominium Built-up : 538 sq. ft. Unfurnished
## 2                       Condominium Built-up : 538 sq. ft. Unfurnished
## 3                       Condominium Built-up : 700 sq. ft. Unfurnished
## 4 Serviced Residence (Intermediate) Built-up : 732 sq. ft. Unfurnished
## 5                       Condominium Built-up : 538 sq. ft. Unfurnished
## 6                Serviced Residence Built-up : 678 sq. ft. Unfurnished

Analyze the basic structure of KL Property dataset.

str(kl_df)

## 'data.frame':    50 obs. of  8 variables:
##  $ Location     : chr  "Bukit Bintang, Kuala Lumpur" "Bukit Bintang, Kuala Lumpur" "KL Eco City, Kuala Lumpur" "KL City, Kuala Lumpur" ...
##  $ Price        : chr  "RM750,000 " "RM800,000 " "RM835,000 " "RM810,000 " ...
##  $ Rooms        : chr  "1" "1" "1" "1" ...
##  $ Bathrooms    : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ Car.Parks    : int  1 1 1 NA 1 NA NA 1 NA 1 ...
##  $ Property.Type: chr  "Condominium" "Condominium" "Condominium" "Serviced Residence (Intermediate)" ...
##  $ Size         : chr  "Built-up : 538 sq. ft." "Built-up : 538 sq. ft." "Built-up : 700 sq. ft." "Built-up : 732 sq. ft." ...
##  $ Furnishing   : chr  "Unfurnished" "Unfurnished" "Unfurnished" "Unfurnished" ...

From this analysis, we see that the data is in two main format which are character and integer. We would like to clean our data in two parts as follows.

Character : For “Location”, Property Type“,”Furnishing“, we would want to leave it as character format. The two variable that we would like to clean are”Location" and “Property Type” -Location : Since all of our properties are in Kuala Lumpur, we would like to remove “,Kuala Lumpur” so that our data would only show the areas within KL. -Property Type : For this, we would want to simplify property types and remove all items in the parentheses. RegEx can be used to capture items within parentheses.

Integer : For “Rooms”, “Bathrooms” and “Car Parks”, we would also leave it as it is. The two variable that we would want in numerical formats are “Price” and “Size”.

kl_df1<-kl_df %>% 
  mutate(across(all_of("Location"), ~gsub(", Kuala Lumpur", "",.))) %>% 
  mutate(across(all_of("Property.Type"), ~gsub(" *\\(.*?\\) *", "",.))) %>% 
  mutate(across(starts_with("Price"), ~gsub("[RM,]", "",.) %>% as.numeric)) %>%  
  separate(.,col="Size",into = c("Size.Type","Size"),sep=":") %>%
  mutate(across(all_of("Size"), ~gsub(" sq. ft.", "",.) %>% as.numeric))

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [37, 40,
## 50].

head(kl_df1)

##        Location  Price Rooms Bathrooms Car.Parks      Property.Type Size.Type
## 1 Bukit Bintang 750000     1         1         1        Condominium Built-up 
## 2 Bukit Bintang 800000     1         1         1        Condominium Built-up 
## 3   KL Eco City 835000     1         1         1        Condominium Built-up 
## 4       KL City 810000     1         1        NA Serviced Residence Built-up 
## 5 Bukit Bintang 765000     1         1         1        Condominium Built-up 
## 6    Mont Kiara 570000     1         1        NA Serviced Residence Built-up 
##   Size  Furnishing
## 1  538 Unfurnished
## 2  538 Unfurnished
## 3  700 Unfurnished
## 4  732 Unfurnished
## 5  538 Unfurnished
## 6  678 Unfurnished

Now we have “Price” and “Size” both in numerical format.Next, we can simplify our table by removing columns that do not carry meaningful values. Since we know that our data represents all unfurnished one bedroom apartments across KL, the column “Rooms” and “Furnishing” can be removed from our dataset.We can also ignore Size Type as stated in “Introduction” tab. Additionally, we would also want to remove “Car Parks” for this analysis since our focus is mainly on property price, size and location.

kl_df1<-subset(kl_df1, select =-c(Rooms,Furnishing,Car.Parks,Size.Type))
head(kl_df1)

##        Location  Price Bathrooms      Property.Type Size
## 1 Bukit Bintang 750000         1        Condominium  538
## 2 Bukit Bintang 800000         1        Condominium  538
## 3   KL Eco City 835000         1        Condominium  700
## 4       KL City 810000         1 Serviced Residence  732
## 5 Bukit Bintang 765000         1        Condominium  538
## 6    Mont Kiara 570000         1 Serviced Residence  678

Our new dataset now has only 5 variables, all in the right format. Finally, let’s remove missing values from our dataset.

colSums(is.na(kl_df1))

##      Location         Price     Bathrooms Property.Type          Size 
##             0             0             1             0             3

kl_df1<-kl_df1[complete.cases(kl_df1),]
head(kl_df1)

##        Location  Price Bathrooms      Property.Type Size
## 1 Bukit Bintang 750000         1        Condominium  538
## 2 Bukit Bintang 800000         1        Condominium  538
## 3   KL Eco City 835000         1        Condominium  700
## 4       KL City 810000         1 Serviced Residence  732
## 5 Bukit Bintang 765000         1        Condominium  538
## 6    Mont Kiara 570000         1 Serviced Residence  678

str(kl_df1)

## 'data.frame':    46 obs. of  5 variables:
##  $ Location     : chr  "Bukit Bintang" "Bukit Bintang" "KL Eco City" "KL City" ...
##  $ Price        : num  750000 800000 835000 810000 765000 570000 324000 530000 770000 470000 ...
##  $ Bathrooms    : int  1 1 1 1 1 1 1 2 1 1 ...
##  $ Property.Type: chr  "Condominium" "Condominium" "Condominium" "Serviced Residence" ...
##  $ Size         : num  538 538 700 732 538 678 500 715 1300 705 ...

We are left with a dataset consisting of 5 variables and 50 observations, ready for data analysis!

##Data Analysis {.tabset.tabset-pills}

Let’s look at the overall summary of the dataset using the the function summary()

summary(kl_df1)

##    Location             Price           Bathrooms     Property.Type     
##  Length:46          Min.   : 200000   Min.   :1.000   Length:46         
##  Class :character   1st Qu.: 400000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median : 562500   Median :1.000   Mode  :character  
##                     Mean   : 577978   Mean   :1.217                     
##                     3rd Qu.: 750000   3rd Qu.:1.000                     
##                     Max.   :1050000   Max.   :2.000                     
##       Size       
##  Min.   : 430.0  
##  1st Qu.: 619.5  
##  Median : 710.0  
##  Mean   : 710.7  
##  3rd Qu.: 767.8  
##  Max.   :1300.0

Numerical attributes gave us the mean and Tukey5-number summary. From this quick analysis, we see that the that the data distributions for Price and Size are almost perfect normal distributions.Now lets plot hsitograms for both attributes.

ggplot(data=kl_df1, aes(Price)) + 
  geom_histogram()+theme_gray()+
  labs(title="Histogram for Price of Studio Apartment in KL")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=kl_df1, aes(Size)) + 
  geom_histogram()+theme_gray()+
  labs(title="Histogram for Size of Studio Apartment in KL")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histograms confirms our theory on property Size. However, the histogram reveals a bimodal data distribution for property Price. This is why it is important to visualize our data distribution. Now we would like to explore the relationship between price and area to see if there is any linearity.

ggplot(data=kl_df1, aes(Price,Size)) + 
  geom_jitter()+
  geom_smooth(method=loess)+
  theme_gray()+
  labs(title="Correlation between property price and area")

## `geom_smooth()` using formula 'y ~ x'

ggplot(data=kl_df1, aes(Price,Location)) + 
  geom_col()+
  theme_gray()+
  labs(title="Correlation between property price and area")

ggplot(data=kl_df1, aes(Property.Type,Location)) + 
  geom_count()+
  labs(title="Correlation between property price and area")

##Function Index {.tabset.tabset-pills}

Assignment 1

Ainaa Najihah Abdul Rahim

4/18/2021