0.1 Introduction



In Singapore, residential properties fall into two board categories: public housing (HDB flats) and private housing. There are numbers of differences between these two types. One of them is the eligibility of buying HDB flat. According to HDB website, applicants must meet various eligibility requirements, including but limited to nationality, income ceiling, property ownership. Even HDB resale flat also have quite a lot of requirements.

Compared to public housing, buying private housing do not have many restrictions. As the result, although the performance of private and public housing interacts with each other, they do have different traits.

Strait Times news “Singapore private home prices inch up 2.7% for 2019” on 24 Jan 2020 link states that “The private housing market went in two directions last year - while prices inched north, the number of sales took a sharp turn south.” In this assignment, we will focus on Singapore private housing resale market, analysing the resale price and volume performance.

0.1.1 Description of dataset


The dataset covers all resale transactions from 1 Jan 2017 to 31 Dec 2019, which is downloaded from Realis database. It contains 20 columns with 37,694 rows. Below table lists the column names and description of the variables.

Column_names Descriptions
Project Name Name of the property project
Address Detail address of the property
No. of Units Number of units sold in the transaction
Area (sqm) Property area in square meter
Type of Area Type of the area, landed or Strata
Transacted Price Total transaction price
Nett Price Null value
Unit Price ($ psm) Unit price per square meter
Unit Price ($ psf) Unit price per square feet
Sale Date Date of the transaction (from 1 Jan 17 to 31 Dec 19)
Property Type Type of the property, condominium, apartment, EC, terrace house or semi-detached house
Tenure Tenure of the property, 99 years or freehold
Completion Date Year of the property built
Type of Sale Type of sale, here is all resale
Purchaser Address Indicator The purchaser stays in HDB or condo or apartment etc
Postal District Post code of the property
Postal Sector Post code of the property
Postal Code Post code of the property
Planning Region Property location, by planning region, e.g. Central area
Planning Area Property location, by planning area, e.g. Geylang


0.1.2 Objective and problem statement

This assignment will focus on Singapore private housing resale market from 1 Jan 2017 to 31 Dec 2019 (3 years), and analysis the trend of transaction price and transaction volume.

0.1.3 Challenges faced and solution

Although the dataset downloaded is quite clean, it still has some problems such as missing data, useless variables, outliers etc. Below table lists down the problems faced and proposed solution.

Problem_faced Description Propsed_solution
Variable type Noted that some variables are not in correct data type, e.g. “Sale Date” should be in date format instead of character. To use proper function, such as as.numeric, to covert the variables into correct data type.
Outlier It is noted that the dataset contains some abnormal transactions. For example, one single transaction had 560 units, or total transaction price was as high as 980 million. These transactions are not reflectable. To exclude outliers based on 3IQR method. Refer to Wikipedia link, outlier is defined as observations that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.
Missing value Notice variables - “Nett Price”, “Completion Date” - contain missing value. To exclude missing value.
New variables to be derived In original data set, some variables are not useful for the analysis, such as “Project Name”, “Address”. Further, in order to perform certain analysis, some new variables need to be derived. To derive new variables: sale_month; sale_year; Property_age_when_sold

0.1.4 Proposed sketched design





0.2 Step by step data visualization

0.2.2 Import data set

The data set is in csv format, we will use read_csv to import it.

data <- read_csv('realis_resale_17_19.csv')

0.2.3 Data preparation

0.2.3.1 Overview of data by histogram

g_unit <- ggplot(data, aes(x= `No. of Units`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_area <- ggplot(data, aes(x= `Area (sqm)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_price <- ggplot(data, aes(x= `Transacted Price ($)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")


g_unitprice <- ggplot(data, aes(x= `Unit Price ($ psm)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

ggarrange(g_unit, g_area, g_price, g_unitprice, ncol = 2, nrow = 2)


0.2.3.2 Overview of data statistics summary

##### Glimpse the data set #####
glimpse(data)
## Rows: 37,693
## Columns: 20
## $ `Project Name`                <chr> "SOUTH BEACH RESIDENCES", "SHELFORD 2...
## $ Address                       <chr> "28 Beach Road  #26-06", "23A Shelfor...
## $ `No. of Units`                <dbl> 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1...
## $ `Area (sqm)`                  <dbl> 162, 155, 105, 117, 61, 788, 142, 204...
## $ `Type of Area`                <chr> "Strata", "Strata", "Strata", "Strata...
## $ `Transacted Price ($)`        <dbl> 5953500, 2250000, 1430000, 968000, 11...
## $ `Nett Price($)`               <chr> "-", "-", "-", "-", "-", "-", "-", "-...
## $ `Unit Price ($ psm)`          <dbl> 36750, 14516, 13619, 8274, 19344, 117...
## $ `Unit Price ($ psf)`          <dbl> 3414, 1349, 1265, 769, 1797, 1094, 88...
## $ `Sale Date`                   <chr> "30-Jun-19", "28-Jun-19", "28-Jun-19"...
## $ `Property Type`               <chr> "Apartment", "Apartment", "Apartment"...
## $ Tenure                        <chr> "99 Yrs From 10/12/2007", "Freehold",...
## $ `Completion Date`             <chr> "2016", "2011", "2001", "2000", "2008...
## $ `Type of Sale`                <chr> "Resale", "Resale", "Resale", "Resale...
## $ `Purchaser Address Indicator` <chr> "N.A", "Private", "Private", "HDB", "...
## $ `Postal District`             <dbl> 7, 11, 7, 19, 1, 9, 15, 23, 8, 15, 19...
## $ `Postal Sector`               <dbl> 18, 28, 18, 54, 1, 22, 42, 65, 21, 43...
## $ `Postal Code`                 <dbl> 189762, 286655, 187948, 545119, 18987...
## $ `Planning Region`             <chr> "Central Region", "Central Region", "...
## $ `Planning Area`               <chr> "Downtown Core", "Bukit Timah", "Roch...
##### View summary statistics #####
summary(data)
##  Project Name         Address           No. of Units       Area (sqm)     
##  Length:37693       Length:37693       Min.   :  1.000   Min.   :   24.0  
##  Class :character   Class :character   1st Qu.:  1.000   1st Qu.:   89.0  
##  Mode  :character   Mode  :character   Median :  1.000   Median :  116.0  
##                                        Mean   :  1.166   Mean   :  172.4  
##                                        3rd Qu.:  1.000   3rd Qu.:  158.0  
##                                        Max.   :560.000   Max.   :87986.0  
##  Type of Area       Transacted Price ($) Nett Price($)      Unit Price ($ psm)
##  Length:37693       Min.   :   400000    Length:37693       Min.   : 1503     
##  Class :character   1st Qu.:  1013000    Class :character   1st Qu.:10174     
##  Mode  :character   Median :  1420000    Mode  :character   Median :12754     
##                     Mean   :  2505211                       Mean   :14018     
##                     3rd Qu.:  2250000                       3rd Qu.:16503     
##                     Max.   :980000000                       Max.   :54363     
##  Unit Price ($ psf)  Sale Date         Property Type         Tenure         
##  Min.   : 140       Length:37693       Length:37693       Length:37693      
##  1st Qu.: 945       Class :character   Class :character   Class :character  
##  Median :1185       Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1302                                                               
##  3rd Qu.:1533                                                               
##  Max.   :5050                                                               
##  Completion Date    Type of Sale       Purchaser Address Indicator
##  Length:37693       Length:37693       Length:37693               
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##  Postal District Postal Sector    Postal Code     Planning Region   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 18965   Length:37693      
##  1st Qu.:10.00   1st Qu.:26.00   1st Qu.:268157   Class :character  
##  Median :15.00   Median :44.00   Median :449305   Mode  :character  
##  Mean   :15.17   Mean   :43.13   Mean   :439412                     
##  3rd Qu.:19.00   3rd Qu.:56.00   3rd Qu.:567746                     
##  Max.   :28.00   Max.   :82.00   Max.   :829750                     
##  Planning Area     
##  Length:37693      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 


From the summary and respective plots above, it can notice that “No. of units”, “Area” and “Transaction price” is highly right-skewed, with maximum number of 560, 87,986 and 98mil respectively. The extrem high amount can be treat as outliers and should be removed. Further variables “Sale Date” and “Completion Date” are in character type, which is not correct based on the nature of these variables.

0.2.3.3 Dealing with Outliers

As mentioned in section 0.1.3, 3 IQR method will be used to identify outliers. The outliers are identified based on 3 variables - “No. of units”, “Area” and “Transacted Price”.

First IQR value for each variables and upper limit (Q3 +1.5IQR) will be calculated. Then filter function will be applied to exclude those higher than upper limit calculated.

##### Calculate IQR and upper limit for outliers #####
IQR_unit_price = IQR(data$`Unit Price ($ psm)`)
IQR_area = IQR(data$`Area (sqm)`)
IQR_price = IQR(data$`Transacted Price ($)`)

unit_price_upper = quantile(data$`Unit Price ($ psm)`,probs = 0.75)+1.5*IQR_unit_price
area_upper = quantile(data$`Area (sqm)`,probs = 0.75)+1.5*IQR_area
price_upper = quantile(data$`Transacted Price ($)`,probs = 0.75)+1.5*IQR_price

##### Filter out outliers #####
data_filtered <- data %>%
  filter(`No. of Units`==1) %>%
  filter(`Completion Date`!= 'Unknown')%>%
  filter((`Area (sqm)`<=area_upper) & 
           (`Transacted Price ($)`<=price_upper) &
           (`Unit Price ($ psm)`<=unit_price_upper))

Review of summary statistics again

summary(data_filtered)
##  Project Name         Address           No. of Units   Area (sqm)   
##  Length:31243       Length:31243       Min.   :1     Min.   : 24.0  
##  Class :character   Class :character   1st Qu.:1     1st Qu.: 85.0  
##  Mode  :character   Mode  :character   Median :1     Median :111.0  
##                                        Mean   :1     Mean   :113.7  
##                                        3rd Qu.:1     3rd Qu.:135.0  
##                                        Max.   :1     Max.   :261.0  
##  Type of Area       Transacted Price ($) Nett Price($)      Unit Price ($ psm)
##  Length:31243       Min.   : 400000      Length:31243       Min.   : 4391     
##  Class :character   1st Qu.: 956944      Class :character   1st Qu.:10120     
##  Mode  :character   Median :1280000      Mode  :character   Median :12500     
##                     Mean   :1469333                         Mean   :13271     
##                     3rd Qu.:1780000                         3rd Qu.:15876     
##                     Max.   :4100888                         Max.   :25982     
##  Unit Price ($ psf)  Sale Date         Property Type         Tenure         
##  Min.   : 408       Length:31243       Length:31243       Length:31243      
##  1st Qu.: 940       Class :character   Class :character   Class :character  
##  Median :1161       Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1233                                                               
##  3rd Qu.:1475                                                               
##  Max.   :2414                                                               
##  Completion Date    Type of Sale       Purchaser Address Indicator
##  Length:31243       Length:31243       Length:31243               
##  Class :character   Class :character   Class :character           
##  Mode  :character   Mode  :character   Mode  :character           
##                                                                   
##                                                                   
##                                                                   
##  Postal District Postal Sector   Postal Code     Planning Region   
##  Min.   : 1.00   Min.   : 1.0   Min.   : 18965   Length:31243      
##  1st Qu.:11.00   1st Qu.:29.0   1st Qu.:298090   Class :character  
##  Median :16.00   Median :46.0   Median :465946   Mode  :character  
##  Mean   :15.42   Mean   :44.1   Mean   :449023                     
##  3rd Qu.:19.00   3rd Qu.:57.0   3rd Qu.:573910                     
##  Max.   :28.00   Max.   :82.0   Max.   :828843                     
##  Planning Area     
##  Length:31243      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Plot histogram again

g_area_1 <- ggplot(data_filtered, aes(x= `Area (sqm)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_price_1 <- ggplot(data_filtered, aes(x= `Transacted Price ($)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

g_unitprice_1 <- ggplot(data_filtered, aes(x= `Unit Price ($ psm)`)) +
  geom_histogram(bins=40, color="black", fill="light blue")

ggarrange(g_area_1, g_price_1, g_unitprice_1, ncol = 2, nrow = 2)


From the distribution of these continuous variables, it can be notice that the spread of these variables (standard deviation) is smaller and it becomes closer to normal distribution.

0.2.3.4 Extract new variables together with change data type

As mentioned, variables “Sale Date” and “Completion Date” are not in correct data type. They should be converted into date type and numeric type respectively.

In addition, in order to perform analysis, several new variables - “sale_year”, “sale_month”, “age_when_sold” will be derived. Variable “age_when_sold” will further be binned into four levels for easier analysis.

##### Convert "sale date" data type and Extract sales month and year #####
data_filtered<- data_filtered %>%
  mutate(sale_date = as.POSIXct(`Sale Date`, format = '%d-%b-%y'))%>%
  mutate(sale_month = month(sale_date)) %>%
  mutate(sale_year = year(sale_date)) %>%
  mutate(sale_period = paste(sale_year, sale_month, sep = '-'))


##### Convert "Completion Date" data type and Extract age_when_sold and bin it######
data_filtered <- data_filtered %>%
  mutate(age_when_sold = sale_year - as.numeric(`Completion Date`)) %>%
  mutate(age_binned = cut(age_when_sold, breaks = c(-Inf, 5, 10, 20, Inf), labels = c('<5', '5-10', '10-20', '>20')))


0.2.3.5 Edit text

In order to make the graph neater, we remove “Region” after each planning region, e.g. change “Central Region” to “Central”.

##### Edit text #####
data_filtered <- data_filtered %>%
  mutate(planning_region = str_replace(`Planning Region`, " Region",""))


0.2.4 Exploratory data analysis

We will explore the data from two major dimensions - Volume and Unit price. For standardise purpose, only “Unit price (psm)” will be used.

0.2.4.1 Exploratory of volume

This section will explore the resale volume along with sale date. In addition, we will use pie chart to see the proportion distribution by property type and planning region.

In order to build graph mentioned above, the data set need to be summarized by number of units as a whole, by property type and by planning region respectively. Further, ggplot2 package will be used to build the graphs. After that, all graph will be consolidated in one big graph through ggarrange.

##### No. of unit sold ~ sale date  #####
df2 <- data_filtered %>%
  group_by(sale_year, sale_month) %>%
  summarize(no_units = sum(`No. of Units`))%>%
  mutate(sale_period = paste(sale_year, sale_month, sep = '-'))

df2 <- rowid_to_column(df2, 'id')

avg=round(mean(df2$no_units))

g1<- ggplot(df2, aes(x = reorder(sale_period,id), y=no_units))+
  geom_line(group = 1)+
  geom_hline(aes(yintercept = avg), color = 'red')+
  geom_text(aes(34,avg,label = paste('Mean: ',avg), vjust = -1, label.size = 0.15))+
  xlab('Period')+
  ylab('No of unit sold')+
  labs(fill = 'Year')+
  coord_cartesian(ylim = c(300,1600))+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9))

##### No. of unit sold ~ sale date (by property type)  #####
df7 <- data_filtered %>%
  group_by(sale_year, sale_month, `Property Type`) %>%
  summarize(no_units = sum(`No. of Units`))%>%
  mutate(sale_period = paste(sale_year, sale_month, sep = '-'))


df7_1 <- df7 %>%
  filter(`Property Type` == 'Apartment')
df7_1 <- rowid_to_column(df7_1, 'id')

df7_2 <- df7 %>%
  filter(`Property Type` == 'Condominium')
df7_2 <- rowid_to_column(df7_2, 'id')

df7_3 <- df7 %>%
  filter(`Property Type` == 'Executive Condominium')
df7_3 <- rowid_to_column(df7_3, 'id')

df7_4 <- df7 %>%
  filter(`Property Type` == 'Semi-Detached House')
df7_4 <- rowid_to_column(df7_4, 'id')

df7_5 <- df7 %>%
  filter(`Property Type` == 'Terrace House')
df7_5 <- rowid_to_column(df7_5, 'id')

g10<- ggplot()+
  geom_line(data = df7_1, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
  geom_line(data = df7_2, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
  geom_line(data = df7_3, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
  geom_line(data = df7_4, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
  geom_line(data = df7_5, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
  xlab('Period')+
  ylab('No of unit sold')+
  coord_cartesian(ylim = c(0,1100))+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
        axis.text.y = element_text(size = 8),
        axis.title.y = element_text(size = 9),
        axis.title.x = element_blank(),
        legend.text = element_text(size = 7),
        legend.position=c(0.5,0.85),
        legend.direction = "horizontal",
        legend.title = element_blank())



##### No. of unit sold ~ Planning region  #####
blank_theme <- theme_minimal()+
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.border = element_blank(),
    panel.grid=element_blank(),
    axis.ticks = element_blank(),
    plot.title=element_text(size=9,face = 'bold'),
    legend.text = element_text(size=7),
    legend.title = element_blank()
  )

df4<-data_filtered %>%
  group_by(`Planning Region`) %>%
  summarise(no_unit = n())

g2<- ggplot(df4, aes(x="", y=no_unit, fill = `Planning Region`))+
  geom_bar(width = 1, stat = 'identity')+
  ggtitle('By Planning Region')+
  coord_polar("y",start = 0)+
  scale_fill_manual(values=c("#CCE5FF", "#E5FFCC", "#EEE8AA", "#E0FFFF","#E6E6FA")) +
  blank_theme +
  theme(axis.text.x=element_blank())+
  geom_text(aes(label = scales::percent(no_unit/sum(no_unit))), position = position_stack(vjust = 0.5), size=2.5)


##### No. of unit sold ~ Property Type #####
df5<-data_filtered %>%
  group_by(`Property Type`) %>%
  summarise(no_unit = n()) %>%
  mutate(grp = c('Apartment','Condominium','Others','Others','Others'))

df6 <- aggregate(df5$no_unit, by = list(df5$`grp`),FUN = sum)

g3<-ggplot(df6, aes(x="", y=x, fill = Group.1))+
  geom_bar(width = 1, stat = 'identity')+
  labs(fill = 'Property Type')+
  ggtitle('By Property Type')+
  coord_polar("y",start = 0)+
  scale_fill_manual(values=c("#FFDAB9", "#E6E6FA", "#DCDCDC")) +
  blank_theme +
  theme(axis.text.x=element_blank())+
  geom_text(aes(label = scales::percent(x/sum(x))), position = position_stack(vjust = 0.5), size=2.5)


##### Arrange three plots togehter #####
figure2 <- ggarrange(g1,g10,
                    ggarrange(g2,g3, ncol = 2), nrow=3, heights = c(2.5, 3, 3))
annotate_figure(figure2,
                top = text_grob('Number of unit sold',
                                face = 'bold', size = 13),
                bottom = text_grob('Data is extracted from Realis database\n from 1 Jan 2017 to 31 Dec 2019',
                                   face = 'italic', color = '#404040', size = 7, hjust = 1, x=1 ))



0.2.4.2 Exploratory of unit price

In this section, we will explore the resale unit price by several factors, such as property type, planning region.

First, 4 graphs will be created (unit price by planning region, by property type, by property age and by sale year) through ggplot2 function. Then these graphs will use ggarrange to consolidate into one graph.

##### Unit price ~ Planning region #####
g4<- ggplot(data_filtered, aes(x=planning_region, y = `Unit Price ($ psm)`))+
  geom_boxplot()+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  ggtitle('by Planning Region')+
  xlab('Planning Region')+
  ylab('Unit price($psm)')+
  theme(plot.title = element_text(size = 10),
        axis.text.x = element_text(vjust = 0.5, size = 8),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        panel.background = element_rect(fill='#F6F6F6'),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))

##### Unit price ~ Property type #####

g5<- ggplot(data_filtered, aes(x=`Property Type`, y = `Unit Price ($ psm)`))+
  geom_boxplot()+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  ggtitle('by Property type')+
  xlab('Property type')+
  scale_x_discrete(labels = c('Apartment', 'Condo', 'EC', 'Semi-Detached\nHouse', 'Terrace\nHouse'))+
  theme(plot.title = element_text(size = 10),
        panel.background = element_rect(fill='#F6F6F6'),
        axis.text.x = element_text(vjust = 0.5, size = 7),
        axis.text.y = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))

##### Unit price ~ age_when_sold #####
g6<- ggplot(data_filtered, aes(x=`age_binned`, y = `Unit Price ($ psm)`))+
  geom_boxplot()+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  ggtitle('by Property age')+
  xlab('Property age')+
  ylab('Unit price($psm)')+
  theme(plot.title = element_text(size = 10),
        axis.text.y = element_text(size = 8),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        panel.background = element_rect(fill='#F6F6F6'),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))


##### Unit price ~ sale_year #####
g7<- ggplot(data_filtered, aes(x=as.factor(`sale_year`), y = `Unit Price ($ psm)`))+
  geom_boxplot()+
  geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
  stat_summary(geom = 'point', fun = 'mean', color='red')+
  ggtitle('by Sale Year')+
  xlab('Year')+
  ylab('Unit price($psm)')+
  theme(plot.title = element_text(size = 10),
        panel.background = element_rect(fill='#F6F6F6'),
        axis.text.y = element_text(size = 8),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 9),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))

##### Arrange plots #####
figure <- ggarrange(g4, g5, g6, g7, nrow = 2, ncol = 2, 
                    labels = c('i)', 'ii)', 'iii)', 'iv)'),
                    font.label = list(size = 10, face = 'italic'))
annotate_figure(figure,
                top = text_grob('Property unit price ($psm) by various factors',
                                face = 'bold', size = 13),
                bottom = text_grob('Data is extracted from Realis database\nfrom 1 Jan 2017 to 31 Dec 2019',
                                   face = 'italic', color = '#404040', size = 7, hjust = 1, x=1))



0.2.5 Statistical test


From the exploration performed above, it seems that unit price of resale property are different in each planning region. Also it is various across property type and property age. However, it seems that unit price does not differ a lot across 3 years.

To statistically prove this observation, ANOVA or Kruskal Wallis test will be performed based on whether data is normally distributed.

Thus, we will first test whether unit price is normally distributed. If yes, ANOVA test will be used. Otherwise, Kruskal Wallis test will be used.

0.2.5.1 Normality test

To perform normality test, we will plot the unit price, and compare with normal distribution line. Then we will plot qq-plot and see whether the line is approximately in straight line bases.

##### Normal distribution test #####
m = mean(data_filtered$`Unit Price ($ psm)`)
std = sd(data_filtered$`Unit Price ($ psm)`)

g8<-ggplot(data_filtered, aes(`Unit Price ($ psm)`))+
  geom_histogram(aes(y=..density..), fill = '#B0C4DE', color = '#808080')+
  stat_function(fun = dnorm, args = list(mean = m, sd = std), col="dark blue", size = .8)+
  ggtitle('Distribution of unit price')+
  theme(plot.title = element_text(size = 10, face = 'bold'),
        panel.background = element_rect(fill='#F6F6F6'),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_text(size = 8),
        axis.title.x = element_text(size = 9))

##### qq Plot #####

g9<-ggplot(data_filtered, aes(sample = `Unit Price ($ psm)`))+
  stat_qq()+
  stat_qq_line()+
  ggtitle('qq Plot')+
  theme(plot.title = element_text(size = 10, face = 'bold'),
        panel.background = element_rect(fill='#F6F6F6'),
        panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
        panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank())

##### Summary table #####

max = max(data_filtered$`Unit Price ($ psm)`)
min = min(data_filtered$`Unit Price ($ psm)`)
median = median(data_filtered$`Unit Price ($ psm)`)
mean = round(mean(data_filtered$`Unit Price ($ psm)`))
std = round(sd(data_filtered$`Unit Price ($ psm)`))
skw = round(e1071::skewness(data_filtered$`Unit Price ($ psm)`),2)


list = c('Maximum', max, 'Minimum', min, 'Median', median, 'Mean', mean, 'SD', std, 'Skewness', skw)
matrix = matrix(list, ncol=4, byrow = T)
tab <- ggtexttable(matrix,theme = ttheme("lBlue", base_size = 9))

tab<- tab %>%
  tab_add_title('Statistics summary', face = 'bold', size = 10)


##### Arrange three plots (normality) #####
figure3 <- ggarrange(g8,
                     ggarrange(g9,tab, ncol = 2), nrow=2)
annotate_figure(figure3,
                top = text_grob('Unit price normality visulization',
                                face = 'bold', size = 13),
                bottom = text_grob('Data is extracted from Realis database\n from 1 Jan 2017 to 31 Dec 2019',
                                   face = 'italic', color = '#404040', size = 7, hjust = 1, x=1))



Based on the comparison of distribution of unit price with normal distribution curve, we can find that the distribution is not very fit. QQ-plot and statistics summary table also show that the distribution seems is not normally distributed.

To further confirm the observation, shapiro.test will be performed. Do take note that due to limitation of shapiro.test which can only take maximum 5,000 data, we will randomly sample the dataset before perform the testing.

##### Shapiro test #####

df_nor<- sample(data_filtered$`Unit Price ($ psm)`, 5000)
shapiro.test(df_nor)
## 
##  Shapiro-Wilk normality test
## 
## data:  df_nor
## W = 0.95823, p-value < 2.2e-16


The shapiro test result shows that p-value is less than 0.05, which means unit price is not normally distributed. Thus Kruskal Wallis test will be used for further testing.

0.2.5.2 Kruskal Wallis test

Kruskal Wallis test is performed to test the hypothesis as indicated below.

1) Unit price ~ Planning region
H0: Unit price is same across all planning region.
H1: Unit price is not same across all planning region.

##### Unit price ~ Planning region #####
kruskal.test(`Unit Price ($ psm)` ~ `Planning Region`, data = data_filtered)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Unit Price ($ psm) by Planning Region
## Kruskal-Wallis chi-squared = 12766, df = 4, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`Planning Region`,
                     p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  data_filtered$`Unit Price ($ psm)` and data_filtered$`Planning Region` 
## 
##                   Central Region East Region North East Region North Region
## East Region       <2e-16         -           -                 -           
## North East Region <2e-16         <2e-16      -                 -           
## North Region      <2e-16         <2e-16      <2e-16            -           
## West Region       <2e-16         0.0029      <2e-16            <2e-16      
## 
## P value adjustment method: BH


The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across planning region.

Further the pairwise comparison shows that there is no planning region’s unit price is statistically same with others as all p-value are less than 0.05.

2) Unit price ~ Property Type
H0: Unit price is same across all property type.
H1: Unit price is not same across all property type.

##### Unit price ~ Property type #####
kruskal.test(`Unit Price ($ psm)` ~ `Property Type`, data = data_filtered)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Unit Price ($ psm) by Property Type
## Kruskal-Wallis chi-squared = 3573.9, df = 4, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`Property Type`,
                     p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  data_filtered$`Unit Price ($ psm)` and data_filtered$`Property Type` 
## 
##                       Apartment Condominium Executive Condominium
## Condominium           < 2e-16   -           -                    
## Executive Condominium < 2e-16   < 2e-16     -                    
## Semi-Detached House   1.1e-10   0.57056     < 2e-16              
## Terrace House         1.8e-13   1.7e-09     < 2e-16              
##                       Semi-Detached House
## Condominium           -                  
## Executive Condominium -                  
## Semi-Detached House   -                  
## Terrace House         0.00065            
## 
## P value adjustment method: BH


The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across property type.

Further the pairwise comparison shows that there is no one type’s unit price is statistically same with others as all p-value are less than 0.05.

3) Unit price ~ Property Age Bin
H0: Unit price is same across all property age bin.
H1: Unit price is not same across all property age bin.

##### Unit price ~ Property age bin #####
kruskal.test(`Unit Price ($ psm)` ~ `age_binned`, data = data_filtered)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Unit Price ($ psm) by age_binned
## Kruskal-Wallis chi-squared = 4678.3, df = 3, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`age_binned`,
                     p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  data_filtered$`Unit Price ($ psm)` and data_filtered$age_binned 
## 
##       <5      5-10    10-20  
## 5-10  5.8e-11 -       -      
## 10-20 < 2e-16 < 2e-16 -      
## >20   < 2e-16 < 2e-16 < 2e-16
## 
## P value adjustment method: BH


The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across age bin.

Further the pairwise comparison shows that there is no one age bin has statistically same unit price with others as all p-value are less than 0.05.

4) Unit price ~ Sale year
H0: Unit price is same across all property sale_year.
H1: Unit price is not same across all property sale_year.

##### Unit price ~ sale_year #####
data_filtered$sale_year <- as.factor(data_filtered$sale_year)
kruskal.test(`Unit Price ($ psm)` ~ `sale_year`, data = data_filtered)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Unit Price ($ psm) by sale_year
## Kruskal-Wallis chi-squared = 66.086, df = 2, p-value = 4.464e-15
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`sale_year`,
                     p.adjust.method = "BH")
## 
##  Pairwise comparisons using Wilcoxon rank sum test 
## 
## data:  data_filtered$`Unit Price ($ psm)` and data_filtered$sale_year 
## 
##      2017    2018
## 2018 1.8e-11 -   
## 2019 5.7e-12 0.36
## 
## P value adjustment method: BH


The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across property sale year.

Further the pairwise comparison shows that unit price of 2017 is statistically different with unit price of 2018 and 2019. However, unit price of 2018 and 2019 is statistically same as p-value is 0.36 which is higher than 0.05.

0.3 Final data visulization

To summarize data visulization, below is two main set of graph - transaction volume and transaction unit price.

1) Transaction volume


2) Transaction unit price


3) Description and observations

It can be noticed that number of transactions increased from 400 units in Jan 2017 to around 1,500 in Apr 2018, followed by sharp decrease from May 2018 to Jul 2018. After that the transaction volume is fluctuated between 600 to 800 units per month. If look through each property type, it can notice that the fluctuation is mainly caused by apartment and condominium. Executive condominium, semi-detached house and terrace house do not have much changes.

Further, it can notice that one-third of transaction are condominium and one-fourth are apartment. Nearly half of the transactions are property located in central area (46%), followed by east region (20%) and north-east region (16%).

Based on the historical transaction data from 2017 to 2019, we can observe that unit price is various by different planning region, property type and property age when sold. From the box plot and violin plot built, we can notice that:

  1. Property in north region have lowest mean and median unit price compared with others.
  2. Unit price in east and north region are right-skewed with long tail of higher price. While unit price in the other regions are roughly symmetric.
  3. Unit price for executive condominium has the smallest standard deviation and lowest mean and median value.
  4. Compared with condominium, apartment have higher unit price on general.
  5. It cannot say that the order the property, the lower the price. On average, property with age between 10 to 20 years has lowest unit price. When the property is more than 20 years old, the unit price might increase.
  6. From the graph, it seems unit price in 2017, 2018 and 2019 are almost same. However, based on the statistic test performed, it shows that unit price in 2017 does not statistically same with price in 2018 and 2019.