In Singapore, residential properties fall into two board categories: public housing (HDB flats) and private housing. There are numbers of differences between these two types. One of them is the eligibility of buying HDB flat. According to HDB website, applicants must meet various eligibility requirements, including but limited to nationality, income ceiling, property ownership. Even HDB resale flat also have quite a lot of requirements.
Compared to public housing, buying private housing do not have many restrictions. As the result, although the performance of private and public housing interacts with each other, they do have different traits.
Strait Times news “Singapore private home prices inch up 2.7% for 2019” on 24 Jan 2020 link states that “The private housing market went in two directions last year - while prices inched north, the number of sales took a sharp turn south.” In this assignment, we will focus on Singapore private housing resale market, analysing the resale price and volume performance.
The dataset covers all resale transactions from 1 Jan 2017 to 31 Dec 2019, which is downloaded from Realis database. It contains 20 columns with 37,694 rows. Below table lists the column names and description of the variables.
| Column_names | Descriptions |
|---|---|
| Project Name | Name of the property project |
| Address | Detail address of the property |
| No. of Units | Number of units sold in the transaction |
| Area (sqm) | Property area in square meter |
| Type of Area | Type of the area, landed or Strata |
| Transacted Price | Total transaction price |
| Nett Price | Null value |
| Unit Price ($ psm) | Unit price per square meter |
| Unit Price ($ psf) | Unit price per square feet |
| Sale Date | Date of the transaction (from 1 Jan 17 to 31 Dec 19) |
| Property Type | Type of the property, condominium, apartment, EC, terrace house or semi-detached house |
| Tenure | Tenure of the property, 99 years or freehold |
| Completion Date | Year of the property built |
| Type of Sale | Type of sale, here is all resale |
| Purchaser Address Indicator | The purchaser stays in HDB or condo or apartment etc |
| Postal District | Post code of the property |
| Postal Sector | Post code of the property |
| Postal Code | Post code of the property |
| Planning Region | Property location, by planning region, e.g. Central area |
| Planning Area | Property location, by planning area, e.g. Geylang |
This assignment will focus on Singapore private housing resale market from 1 Jan 2017 to 31 Dec 2019 (3 years), and analysis the trend of transaction price and transaction volume.
| Problem_faced | Description | Propsed_solution |
|---|---|---|
| Variable type | Noted that some variables are not in correct data type, e.g. “Sale Date” should be in date format instead of character. | To use proper function, such as as.numeric, to covert the variables into correct data type. |
| Outlier | It is noted that the dataset contains some abnormal transactions. For example, one single transaction had 560 units, or total transaction price was as high as 980 million. These transactions are not reflectable. | To exclude outliers based on 3IQR method. Refer to Wikipedia link, outlier is defined as observations that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR. |
| Missing value | Notice variables - “Nett Price”, “Completion Date” - contain missing value. | To exclude missing value. |
| New variables to be derived | In original data set, some variables are not useful for the analysis, such as “Project Name”, “Address”. Further, in order to perform certain analysis, some new variables need to be derived. | To derive new variables: sale_month; sale_year; Property_age_when_sold |
The data set is in csv format, we will use read_csv to import it.
data <- read_csv('realis_resale_17_19.csv')
g_unit <- ggplot(data, aes(x= `No. of Units`)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_area <- ggplot(data, aes(x= `Area (sqm)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_price <- ggplot(data, aes(x= `Transacted Price ($)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_unitprice <- ggplot(data, aes(x= `Unit Price ($ psm)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
ggarrange(g_unit, g_area, g_price, g_unitprice, ncol = 2, nrow = 2)
##### Glimpse the data set #####
glimpse(data)
## Rows: 37,693
## Columns: 20
## $ `Project Name` <chr> "SOUTH BEACH RESIDENCES", "SHELFORD 2...
## $ Address <chr> "28 Beach Road #26-06", "23A Shelfor...
## $ `No. of Units` <dbl> 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1...
## $ `Area (sqm)` <dbl> 162, 155, 105, 117, 61, 788, 142, 204...
## $ `Type of Area` <chr> "Strata", "Strata", "Strata", "Strata...
## $ `Transacted Price ($)` <dbl> 5953500, 2250000, 1430000, 968000, 11...
## $ `Nett Price($)` <chr> "-", "-", "-", "-", "-", "-", "-", "-...
## $ `Unit Price ($ psm)` <dbl> 36750, 14516, 13619, 8274, 19344, 117...
## $ `Unit Price ($ psf)` <dbl> 3414, 1349, 1265, 769, 1797, 1094, 88...
## $ `Sale Date` <chr> "30-Jun-19", "28-Jun-19", "28-Jun-19"...
## $ `Property Type` <chr> "Apartment", "Apartment", "Apartment"...
## $ Tenure <chr> "99 Yrs From 10/12/2007", "Freehold",...
## $ `Completion Date` <chr> "2016", "2011", "2001", "2000", "2008...
## $ `Type of Sale` <chr> "Resale", "Resale", "Resale", "Resale...
## $ `Purchaser Address Indicator` <chr> "N.A", "Private", "Private", "HDB", "...
## $ `Postal District` <dbl> 7, 11, 7, 19, 1, 9, 15, 23, 8, 15, 19...
## $ `Postal Sector` <dbl> 18, 28, 18, 54, 1, 22, 42, 65, 21, 43...
## $ `Postal Code` <dbl> 189762, 286655, 187948, 545119, 18987...
## $ `Planning Region` <chr> "Central Region", "Central Region", "...
## $ `Planning Area` <chr> "Downtown Core", "Bukit Timah", "Roch...
##### View summary statistics #####
summary(data)
## Project Name Address No. of Units Area (sqm)
## Length:37693 Length:37693 Min. : 1.000 Min. : 24.0
## Class :character Class :character 1st Qu.: 1.000 1st Qu.: 89.0
## Mode :character Mode :character Median : 1.000 Median : 116.0
## Mean : 1.166 Mean : 172.4
## 3rd Qu.: 1.000 3rd Qu.: 158.0
## Max. :560.000 Max. :87986.0
## Type of Area Transacted Price ($) Nett Price($) Unit Price ($ psm)
## Length:37693 Min. : 400000 Length:37693 Min. : 1503
## Class :character 1st Qu.: 1013000 Class :character 1st Qu.:10174
## Mode :character Median : 1420000 Mode :character Median :12754
## Mean : 2505211 Mean :14018
## 3rd Qu.: 2250000 3rd Qu.:16503
## Max. :980000000 Max. :54363
## Unit Price ($ psf) Sale Date Property Type Tenure
## Min. : 140 Length:37693 Length:37693 Length:37693
## 1st Qu.: 945 Class :character Class :character Class :character
## Median :1185 Mode :character Mode :character Mode :character
## Mean :1302
## 3rd Qu.:1533
## Max. :5050
## Completion Date Type of Sale Purchaser Address Indicator
## Length:37693 Length:37693 Length:37693
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Postal District Postal Sector Postal Code Planning Region
## Min. : 1.00 Min. : 1.00 Min. : 18965 Length:37693
## 1st Qu.:10.00 1st Qu.:26.00 1st Qu.:268157 Class :character
## Median :15.00 Median :44.00 Median :449305 Mode :character
## Mean :15.17 Mean :43.13 Mean :439412
## 3rd Qu.:19.00 3rd Qu.:56.00 3rd Qu.:567746
## Max. :28.00 Max. :82.00 Max. :829750
## Planning Area
## Length:37693
## Class :character
## Mode :character
##
##
##
From the summary and respective plots above, it can notice that “No. of units”, “Area” and “Transaction price” is highly right-skewed, with maximum number of 560, 87,986 and 98mil respectively. The extrem high amount can be treat as outliers and should be removed. Further variables “Sale Date” and “Completion Date” are in character type, which is not correct based on the nature of these variables.
As mentioned in section 0.1.3, 3 IQR method will be used to identify outliers. The outliers are identified based on 3 variables - “No. of units”, “Area” and “Transacted Price”.
First IQR value for each variables and upper limit (Q3 +1.5IQR) will be calculated. Then filter function will be applied to exclude those higher than upper limit calculated.
##### Calculate IQR and upper limit for outliers #####
IQR_unit_price = IQR(data$`Unit Price ($ psm)`)
IQR_area = IQR(data$`Area (sqm)`)
IQR_price = IQR(data$`Transacted Price ($)`)
unit_price_upper = quantile(data$`Unit Price ($ psm)`,probs = 0.75)+1.5*IQR_unit_price
area_upper = quantile(data$`Area (sqm)`,probs = 0.75)+1.5*IQR_area
price_upper = quantile(data$`Transacted Price ($)`,probs = 0.75)+1.5*IQR_price
##### Filter out outliers #####
data_filtered <- data %>%
filter(`No. of Units`==1) %>%
filter(`Completion Date`!= 'Unknown')%>%
filter((`Area (sqm)`<=area_upper) &
(`Transacted Price ($)`<=price_upper) &
(`Unit Price ($ psm)`<=unit_price_upper))
Review of summary statistics again
summary(data_filtered)
## Project Name Address No. of Units Area (sqm)
## Length:31243 Length:31243 Min. :1 Min. : 24.0
## Class :character Class :character 1st Qu.:1 1st Qu.: 85.0
## Mode :character Mode :character Median :1 Median :111.0
## Mean :1 Mean :113.7
## 3rd Qu.:1 3rd Qu.:135.0
## Max. :1 Max. :261.0
## Type of Area Transacted Price ($) Nett Price($) Unit Price ($ psm)
## Length:31243 Min. : 400000 Length:31243 Min. : 4391
## Class :character 1st Qu.: 956944 Class :character 1st Qu.:10120
## Mode :character Median :1280000 Mode :character Median :12500
## Mean :1469333 Mean :13271
## 3rd Qu.:1780000 3rd Qu.:15876
## Max. :4100888 Max. :25982
## Unit Price ($ psf) Sale Date Property Type Tenure
## Min. : 408 Length:31243 Length:31243 Length:31243
## 1st Qu.: 940 Class :character Class :character Class :character
## Median :1161 Mode :character Mode :character Mode :character
## Mean :1233
## 3rd Qu.:1475
## Max. :2414
## Completion Date Type of Sale Purchaser Address Indicator
## Length:31243 Length:31243 Length:31243
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Postal District Postal Sector Postal Code Planning Region
## Min. : 1.00 Min. : 1.0 Min. : 18965 Length:31243
## 1st Qu.:11.00 1st Qu.:29.0 1st Qu.:298090 Class :character
## Median :16.00 Median :46.0 Median :465946 Mode :character
## Mean :15.42 Mean :44.1 Mean :449023
## 3rd Qu.:19.00 3rd Qu.:57.0 3rd Qu.:573910
## Max. :28.00 Max. :82.0 Max. :828843
## Planning Area
## Length:31243
## Class :character
## Mode :character
##
##
##
Plot histogram again
g_area_1 <- ggplot(data_filtered, aes(x= `Area (sqm)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_price_1 <- ggplot(data_filtered, aes(x= `Transacted Price ($)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
g_unitprice_1 <- ggplot(data_filtered, aes(x= `Unit Price ($ psm)`)) +
geom_histogram(bins=40, color="black", fill="light blue")
ggarrange(g_area_1, g_price_1, g_unitprice_1, ncol = 2, nrow = 2)
From the distribution of these continuous variables, it can be notice that the spread of these variables (standard deviation) is smaller and it becomes closer to normal distribution.
As mentioned, variables “Sale Date” and “Completion Date” are not in correct data type. They should be converted into date type and numeric type respectively.
In addition, in order to perform analysis, several new variables - “sale_year”, “sale_month”, “age_when_sold” will be derived. Variable “age_when_sold” will further be binned into four levels for easier analysis.
##### Convert "sale date" data type and Extract sales month and year #####
data_filtered<- data_filtered %>%
mutate(sale_date = as.POSIXct(`Sale Date`, format = '%d-%b-%y'))%>%
mutate(sale_month = month(sale_date)) %>%
mutate(sale_year = year(sale_date)) %>%
mutate(sale_period = paste(sale_year, sale_month, sep = '-'))
##### Convert "Completion Date" data type and Extract age_when_sold and bin it######
data_filtered <- data_filtered %>%
mutate(age_when_sold = sale_year - as.numeric(`Completion Date`)) %>%
mutate(age_binned = cut(age_when_sold, breaks = c(-Inf, 5, 10, 20, Inf), labels = c('<5', '5-10', '10-20', '>20')))
In order to make the graph neater, we remove “Region” after each planning region, e.g. change “Central Region” to “Central”.
##### Edit text #####
data_filtered <- data_filtered %>%
mutate(planning_region = str_replace(`Planning Region`, " Region",""))
We will explore the data from two major dimensions - Volume and Unit price. For standardise purpose, only “Unit price (psm)” will be used.
This section will explore the resale volume along with sale date. In addition, we will use pie chart to see the proportion distribution by property type and planning region.
In order to build graph mentioned above, the data set need to be summarized by number of units as a whole, by property type and by planning region respectively. Further, ggplot2 package will be used to build the graphs. After that, all graph will be consolidated in one big graph through ggarrange.
##### No. of unit sold ~ sale date #####
df2 <- data_filtered %>%
group_by(sale_year, sale_month) %>%
summarize(no_units = sum(`No. of Units`))%>%
mutate(sale_period = paste(sale_year, sale_month, sep = '-'))
df2 <- rowid_to_column(df2, 'id')
avg=round(mean(df2$no_units))
g1<- ggplot(df2, aes(x = reorder(sale_period,id), y=no_units))+
geom_line(group = 1)+
geom_hline(aes(yintercept = avg), color = 'red')+
geom_text(aes(34,avg,label = paste('Mean: ',avg), vjust = -1, label.size = 0.15))+
xlab('Period')+
ylab('No of unit sold')+
labs(fill = 'Year')+
coord_cartesian(ylim = c(300,1600))+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9))
##### No. of unit sold ~ sale date (by property type) #####
df7 <- data_filtered %>%
group_by(sale_year, sale_month, `Property Type`) %>%
summarize(no_units = sum(`No. of Units`))%>%
mutate(sale_period = paste(sale_year, sale_month, sep = '-'))
df7_1 <- df7 %>%
filter(`Property Type` == 'Apartment')
df7_1 <- rowid_to_column(df7_1, 'id')
df7_2 <- df7 %>%
filter(`Property Type` == 'Condominium')
df7_2 <- rowid_to_column(df7_2, 'id')
df7_3 <- df7 %>%
filter(`Property Type` == 'Executive Condominium')
df7_3 <- rowid_to_column(df7_3, 'id')
df7_4 <- df7 %>%
filter(`Property Type` == 'Semi-Detached House')
df7_4 <- rowid_to_column(df7_4, 'id')
df7_5 <- df7 %>%
filter(`Property Type` == 'Terrace House')
df7_5 <- rowid_to_column(df7_5, 'id')
g10<- ggplot()+
geom_line(data = df7_1, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
geom_line(data = df7_2, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
geom_line(data = df7_3, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
geom_line(data = df7_4, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
geom_line(data = df7_5, aes(x = reorder(sale_period, id), y=no_units, color = `Property Type`),group = 1)+
xlab('Period')+
ylab('No of unit sold')+
coord_cartesian(ylim = c(0,1100))+
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 7),
axis.text.y = element_text(size = 8),
axis.title.y = element_text(size = 9),
axis.title.x = element_blank(),
legend.text = element_text(size = 7),
legend.position=c(0.5,0.85),
legend.direction = "horizontal",
legend.title = element_blank())
##### No. of unit sold ~ Planning region #####
blank_theme <- theme_minimal()+
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid=element_blank(),
axis.ticks = element_blank(),
plot.title=element_text(size=9,face = 'bold'),
legend.text = element_text(size=7),
legend.title = element_blank()
)
df4<-data_filtered %>%
group_by(`Planning Region`) %>%
summarise(no_unit = n())
g2<- ggplot(df4, aes(x="", y=no_unit, fill = `Planning Region`))+
geom_bar(width = 1, stat = 'identity')+
ggtitle('By Planning Region')+
coord_polar("y",start = 0)+
scale_fill_manual(values=c("#CCE5FF", "#E5FFCC", "#EEE8AA", "#E0FFFF","#E6E6FA")) +
blank_theme +
theme(axis.text.x=element_blank())+
geom_text(aes(label = scales::percent(no_unit/sum(no_unit))), position = position_stack(vjust = 0.5), size=2.5)
##### No. of unit sold ~ Property Type #####
df5<-data_filtered %>%
group_by(`Property Type`) %>%
summarise(no_unit = n()) %>%
mutate(grp = c('Apartment','Condominium','Others','Others','Others'))
df6 <- aggregate(df5$no_unit, by = list(df5$`grp`),FUN = sum)
g3<-ggplot(df6, aes(x="", y=x, fill = Group.1))+
geom_bar(width = 1, stat = 'identity')+
labs(fill = 'Property Type')+
ggtitle('By Property Type')+
coord_polar("y",start = 0)+
scale_fill_manual(values=c("#FFDAB9", "#E6E6FA", "#DCDCDC")) +
blank_theme +
theme(axis.text.x=element_blank())+
geom_text(aes(label = scales::percent(x/sum(x))), position = position_stack(vjust = 0.5), size=2.5)
##### Arrange three plots togehter #####
figure2 <- ggarrange(g1,g10,
ggarrange(g2,g3, ncol = 2), nrow=3, heights = c(2.5, 3, 3))
annotate_figure(figure2,
top = text_grob('Number of unit sold',
face = 'bold', size = 13),
bottom = text_grob('Data is extracted from Realis database\n from 1 Jan 2017 to 31 Dec 2019',
face = 'italic', color = '#404040', size = 7, hjust = 1, x=1 ))
In this section, we will explore the resale unit price by several factors, such as property type, planning region.
First, 4 graphs will be created (unit price by planning region, by property type, by property age and by sale year) through ggplot2 function. Then these graphs will use ggarrange to consolidate into one graph.
##### Unit price ~ Planning region #####
g4<- ggplot(data_filtered, aes(x=planning_region, y = `Unit Price ($ psm)`))+
geom_boxplot()+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
ggtitle('by Planning Region')+
xlab('Planning Region')+
ylab('Unit price($psm)')+
theme(plot.title = element_text(size = 10),
axis.text.x = element_text(vjust = 0.5, size = 8),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
panel.background = element_rect(fill='#F6F6F6'),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))
##### Unit price ~ Property type #####
g5<- ggplot(data_filtered, aes(x=`Property Type`, y = `Unit Price ($ psm)`))+
geom_boxplot()+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
ggtitle('by Property type')+
xlab('Property type')+
scale_x_discrete(labels = c('Apartment', 'Condo', 'EC', 'Semi-Detached\nHouse', 'Terrace\nHouse'))+
theme(plot.title = element_text(size = 10),
panel.background = element_rect(fill='#F6F6F6'),
axis.text.x = element_text(vjust = 0.5, size = 7),
axis.text.y = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))
##### Unit price ~ age_when_sold #####
g6<- ggplot(data_filtered, aes(x=`age_binned`, y = `Unit Price ($ psm)`))+
geom_boxplot()+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
ggtitle('by Property age')+
xlab('Property age')+
ylab('Unit price($psm)')+
theme(plot.title = element_text(size = 10),
axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
panel.background = element_rect(fill='#F6F6F6'),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))
##### Unit price ~ sale_year #####
g7<- ggplot(data_filtered, aes(x=as.factor(`sale_year`), y = `Unit Price ($ psm)`))+
geom_boxplot()+
geom_violin(color = '#FFCCCC', fill ='#FFCCCC', alpha = 0.5 )+
stat_summary(geom = 'point', fun = 'mean', color='red')+
ggtitle('by Sale Year')+
xlab('Year')+
ylab('Unit price($psm)')+
theme(plot.title = element_text(size = 10),
panel.background = element_rect(fill='#F6F6F6'),
axis.text.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 9),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'))
##### Arrange plots #####
figure <- ggarrange(g4, g5, g6, g7, nrow = 2, ncol = 2,
labels = c('i)', 'ii)', 'iii)', 'iv)'),
font.label = list(size = 10, face = 'italic'))
annotate_figure(figure,
top = text_grob('Property unit price ($psm) by various factors',
face = 'bold', size = 13),
bottom = text_grob('Data is extracted from Realis database\nfrom 1 Jan 2017 to 31 Dec 2019',
face = 'italic', color = '#404040', size = 7, hjust = 1, x=1))
From the exploration performed above, it seems that unit price of resale property are different in each planning region. Also it is various across property type and property age. However, it seems that unit price does not differ a lot across 3 years.
To statistically prove this observation, ANOVA or Kruskal Wallis test will be performed based on whether data is normally distributed.
Thus, we will first test whether unit price is normally distributed. If yes, ANOVA test will be used. Otherwise, Kruskal Wallis test will be used.
To perform normality test, we will plot the unit price, and compare with normal distribution line. Then we will plot qq-plot and see whether the line is approximately in straight line bases.
##### Normal distribution test #####
m = mean(data_filtered$`Unit Price ($ psm)`)
std = sd(data_filtered$`Unit Price ($ psm)`)
g8<-ggplot(data_filtered, aes(`Unit Price ($ psm)`))+
geom_histogram(aes(y=..density..), fill = '#B0C4DE', color = '#808080')+
stat_function(fun = dnorm, args = list(mean = m, sd = std), col="dark blue", size = .8)+
ggtitle('Distribution of unit price')+
theme(plot.title = element_text(size = 10, face = 'bold'),
panel.background = element_rect(fill='#F6F6F6'),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 9))
##### qq Plot #####
g9<-ggplot(data_filtered, aes(sample = `Unit Price ($ psm)`))+
stat_qq()+
stat_qq_line()+
ggtitle('qq Plot')+
theme(plot.title = element_text(size = 10, face = 'bold'),
panel.background = element_rect(fill='#F6F6F6'),
panel.grid.major = element_line(size = 0.25, linetype = 'solid', colour = 'white'),
panel.grid.minor = element_line(size = 0.25, linetype = 'dashed', colour = 'white'),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.x = element_blank())
##### Summary table #####
max = max(data_filtered$`Unit Price ($ psm)`)
min = min(data_filtered$`Unit Price ($ psm)`)
median = median(data_filtered$`Unit Price ($ psm)`)
mean = round(mean(data_filtered$`Unit Price ($ psm)`))
std = round(sd(data_filtered$`Unit Price ($ psm)`))
skw = round(e1071::skewness(data_filtered$`Unit Price ($ psm)`),2)
list = c('Maximum', max, 'Minimum', min, 'Median', median, 'Mean', mean, 'SD', std, 'Skewness', skw)
matrix = matrix(list, ncol=4, byrow = T)
tab <- ggtexttable(matrix,theme = ttheme("lBlue", base_size = 9))
tab<- tab %>%
tab_add_title('Statistics summary', face = 'bold', size = 10)
##### Arrange three plots (normality) #####
figure3 <- ggarrange(g8,
ggarrange(g9,tab, ncol = 2), nrow=2)
annotate_figure(figure3,
top = text_grob('Unit price normality visulization',
face = 'bold', size = 13),
bottom = text_grob('Data is extracted from Realis database\n from 1 Jan 2017 to 31 Dec 2019',
face = 'italic', color = '#404040', size = 7, hjust = 1, x=1))
Based on the comparison of distribution of unit price with normal distribution curve, we can find that the distribution is not very fit. QQ-plot and statistics summary table also show that the distribution seems is not normally distributed.
To further confirm the observation, shapiro.test will be performed. Do take note that due to limitation of shapiro.test which can only take maximum 5,000 data, we will randomly sample the dataset before perform the testing.
##### Shapiro test #####
df_nor<- sample(data_filtered$`Unit Price ($ psm)`, 5000)
shapiro.test(df_nor)
##
## Shapiro-Wilk normality test
##
## data: df_nor
## W = 0.95823, p-value < 2.2e-16
The shapiro test result shows that p-value is less than 0.05, which means unit price is not normally distributed. Thus Kruskal Wallis test will be used for further testing.
Kruskal Wallis test is performed to test the hypothesis as indicated below.
1) Unit price ~ Planning region
H0: Unit price is same across all planning region.
H1: Unit price is not same across all planning region.
##### Unit price ~ Planning region #####
kruskal.test(`Unit Price ($ psm)` ~ `Planning Region`, data = data_filtered)
##
## Kruskal-Wallis rank sum test
##
## data: Unit Price ($ psm) by Planning Region
## Kruskal-Wallis chi-squared = 12766, df = 4, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`Planning Region`,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: data_filtered$`Unit Price ($ psm)` and data_filtered$`Planning Region`
##
## Central Region East Region North East Region North Region
## East Region <2e-16 - - -
## North East Region <2e-16 <2e-16 - -
## North Region <2e-16 <2e-16 <2e-16 -
## West Region <2e-16 0.0029 <2e-16 <2e-16
##
## P value adjustment method: BH
The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across planning region.
Further the pairwise comparison shows that there is no planning region’s unit price is statistically same with others as all p-value are less than 0.05.
2) Unit price ~ Property Type
H0: Unit price is same across all property type.
H1: Unit price is not same across all property type.
##### Unit price ~ Property type #####
kruskal.test(`Unit Price ($ psm)` ~ `Property Type`, data = data_filtered)
##
## Kruskal-Wallis rank sum test
##
## data: Unit Price ($ psm) by Property Type
## Kruskal-Wallis chi-squared = 3573.9, df = 4, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`Property Type`,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: data_filtered$`Unit Price ($ psm)` and data_filtered$`Property Type`
##
## Apartment Condominium Executive Condominium
## Condominium < 2e-16 - -
## Executive Condominium < 2e-16 < 2e-16 -
## Semi-Detached House 1.1e-10 0.57056 < 2e-16
## Terrace House 1.8e-13 1.7e-09 < 2e-16
## Semi-Detached House
## Condominium -
## Executive Condominium -
## Semi-Detached House -
## Terrace House 0.00065
##
## P value adjustment method: BH
The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across property type.
Further the pairwise comparison shows that there is no one type’s unit price is statistically same with others as all p-value are less than 0.05.
3) Unit price ~ Property Age Bin
H0: Unit price is same across all property age bin.
H1: Unit price is not same across all property age bin.
##### Unit price ~ Property age bin #####
kruskal.test(`Unit Price ($ psm)` ~ `age_binned`, data = data_filtered)
##
## Kruskal-Wallis rank sum test
##
## data: Unit Price ($ psm) by age_binned
## Kruskal-Wallis chi-squared = 4678.3, df = 3, p-value < 2.2e-16
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`age_binned`,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: data_filtered$`Unit Price ($ psm)` and data_filtered$age_binned
##
## <5 5-10 10-20
## 5-10 5.8e-11 - -
## 10-20 < 2e-16 < 2e-16 -
## >20 < 2e-16 < 2e-16 < 2e-16
##
## P value adjustment method: BH
The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across age bin.
Further the pairwise comparison shows that there is no one age bin has statistically same unit price with others as all p-value are less than 0.05.
4) Unit price ~ Sale year
H0: Unit price is same across all property sale_year.
H1: Unit price is not same across all property sale_year.
##### Unit price ~ sale_year #####
data_filtered$sale_year <- as.factor(data_filtered$sale_year)
kruskal.test(`Unit Price ($ psm)` ~ `sale_year`, data = data_filtered)
##
## Kruskal-Wallis rank sum test
##
## data: Unit Price ($ psm) by sale_year
## Kruskal-Wallis chi-squared = 66.086, df = 2, p-value = 4.464e-15
pairwise.wilcox.test(data_filtered$`Unit Price ($ psm)`, data_filtered$`sale_year`,
p.adjust.method = "BH")
##
## Pairwise comparisons using Wilcoxon rank sum test
##
## data: data_filtered$`Unit Price ($ psm)` and data_filtered$sale_year
##
## 2017 2018
## 2018 1.8e-11 -
## 2019 5.7e-12 0.36
##
## P value adjustment method: BH
The result shows that p-value is less than 0.05, thus we reject null hypothesis and can conclude that unit price is not same across property sale year.
Further the pairwise comparison shows that unit price of 2017 is statistically different with unit price of 2018 and 2019. However, unit price of 2018 and 2019 is statistically same as p-value is 0.36 which is higher than 0.05.
To summarize data visulization, below is two main set of graph - transaction volume and transaction unit price.
1) Transaction volume
2) Transaction unit price
3) Description and observations
It can be noticed that number of transactions increased from 400 units in Jan 2017 to around 1,500 in Apr 2018, followed by sharp decrease from May 2018 to Jul 2018. After that the transaction volume is fluctuated between 600 to 800 units per month. If look through each property type, it can notice that the fluctuation is mainly caused by apartment and condominium. Executive condominium, semi-detached house and terrace house do not have much changes.
Further, it can notice that one-third of transaction are condominium and one-fourth are apartment. Nearly half of the transactions are property located in central area (46%), followed by east region (20%) and north-east region (16%).
Based on the historical transaction data from 2017 to 2019, we can observe that unit price is various by different planning region, property type and property age when sold. From the box plot and violin plot built, we can notice that: