This report aims to visualise the geographical distribution of Singapore’s median property prices and resident’s median income by planning area.
The purpose of the visualisation is to understand whether there is any relationship between the median income and median property prices for each planning area. In addition, we will also try to find out if there are any planning areas that are different from the others.
The first challenge is with regard to the availability of dataset. We were not able to find datasets on household income by planning areas. However we were able to find working persons aged 15 years and over by planning area and gross monthly income from work, 2015. Please refer to the following link: https://data.gov.sg/dataset/resident-working-persons-aged-15-years-over-by-planning-area-gross-monthly-income-from-work-2015?resource_id=e4c209d3-4a07-426a-baeb-a7026d09241c. We will use this dataset to derive the median income, as a proxy of how affluent each planning area is. We will also need to perform substantial data transformation to attain the median income for each planning area.
We will incorporate the Real Estate Information System (REALIS) dataset for all private property transactions in 2018. The REALIS dataset contains a wide range of information on the Singapore property market, and it is only available from Urban Redevelopment Authority (URA) on subscription.
We will also incorporate the HDB (public housing) resale dataset, and only filter for transactions in 2018. Please refer to the following link: https://data.gov.sg/dataset/resale-flat-prices?resource_id=42ff9cfe-abe5-4b54-beda-c88f9bb438ee.We will combine with with the REALIS dataset to compute the average property prices for each planning area.
We noted that the period of the income dataset is for 2015, and the period of the Realis dataset is for 2018, however we will proceed to study these 2 datasets together as they are the closest that we can find for our use case.
For the planning area boundary of Singapore, we will download the shape file from the following data.gov.sg link: https://data.gov.sg/dataset/master-plan-2014-planning-area-boundary-web?resource_id=2ab23cb2-b1a4-4b1a-a9e1-b9cad0ac159b. The file was last updated on 5 December 2014.
Please see the proposed sketch design. We will create an interactive map, and visualise the median property prices by the shading of each polygon. We will visualise the median income of each planning area by the shading and size of each bubble. We will also have a tooltip which will display additional information about each polygon.
We will also create the same map a second time, but replace median property income with median price per square foot (PSF).
packages = c( 'sf', 'tmap', 'tidyverse', 'readxl', 'knitr', 'ggplot2', 'scales')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
sf
is a package to encode spatial vector data.tmap
is used build thematic map with interaction.tidyverse
is a set of packages to perform data wrangling and exploration.readxl
is used to read excel files into R.knitr
is used to enable integration of certain R codes into HTML.ggplot2
is a data visualisation package for visualisation and statistical programming language R.scales
is used to provide methods to determine breaks and labels for axes and legends.First we will load the planning area boundary shape file.
## Reading layer `MP14_PLNG_AREA_WEB_PL' from data source `C:\Users\User\Desktop\SMU Term 2\4. Visual G2\A5. Assignment 5 20200807 (Submitted)\ISSS608-Assignment5-KhooWenYong-v3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 55 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS: SVY21
## Simple feature collection with 55 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS: SVY21
## First 10 features:
## OBJECTID PLN_AREA_N PLN_AREA_C CA_IND REGION_N REGION_C
## 1 1 BISHAN BS N CENTRAL REGION CR
## 2 2 BUKIT BATOK BK N WEST REGION WR
## 3 3 BUKIT MERAH BM N CENTRAL REGION CR
## 4 4 BUKIT PANJANG BP N WEST REGION WR
## 5 5 BUKIT TIMAH BT N CENTRAL REGION CR
## 6 6 CENTRAL WATER CATCHMENT CC N NORTH REGION NR
## 7 7 CHANGI CH N EAST REGION ER
## 8 8 CHANGI BAY CB N EAST REGION ER
## 9 9 CHOA CHU KANG CK N WEST REGION WR
## 10 10 CLEMENTI CL N WEST REGION WR
## INC_CRC FMEL_UPD_D X_ADDR Y_ADDR SHAPE_Leng SHAPE_Area
## 1 BA616285F402846F 2014-12-05 28789.76 37450.89 13517.12 7618921
## 2 FB44C870B04B7F57 2014-12-05 19255.42 37527.65 15234.22 11133256
## 3 738B479882E4EE28 2014-12-05 26865.78 28662.87 29156.29 14462472
## 4 4A9C6E6BAF7BE998 2014-12-05 21287.04 38761.84 15891.85 9019940
## 5 C893AEAD20F42559 2014-12-05 23256.76 34689.00 22492.84 17526654
## 6 52D0068508B0348A 2014-12-05 24424.42 39849.05 30538.25 37147854
## 7 70EAFFC7C76EA5AD 2014-12-05 46307.48 36991.43 31342.12 40940490
## 8 ACFD43D88C72381C 2014-12-05 49502.49 34316.54 18731.56 1829822
## 9 F361929FAF5E9611 2014-12-05 18415.05 40833.42 11518.81 6117294
## 10 8A5BEDB748EAA0AA 2014-12-05 19923.36 33319.17 15024.87 9516228
## geometry
## 1 MULTIPOLYGON (((29772.19 38...
## 2 MULTIPOLYGON (((20294.46 39...
## 3 MULTIPOLYGON (((26228.63 30...
## 4 MULTIPOLYGON (((21448.72 41...
## 5 MULTIPOLYGON (((24031.39 36...
## 6 MULTIPOLYGON (((25073.29 43...
## 7 MULTIPOLYGON (((44586.3 417...
## 8 MULTIPOLYGON (((48860.11 34...
## 9 MULTIPOLYGON (((18349.06 43...
## 10 MULTIPOLYGON (((19680.06 31...
We will perform a quick plot to ensure that the map works well.
Next we will load the income dataset.
## Parsed with column specification:
## cols(
## year = col_double(),
## level_1 = col_character(),
## level_2 = col_character(),
## level_3 = col_character(),
## value = col_double()
## )
We will select 5 random rows to inspect the dataset.
check <- planningarea_income[sample(nrow(planningarea_income), 5), ]
kable(check, format = "markdown", align='l')
year | level_1 | level_2 | level_3 | value |
---|---|---|---|---|
2015 | $3,000 - $3,999 | Total | Bukit Timah | 2.1 |
2015 | $8,000 - $8,999 | Total | Others | 0.9 |
2015 | $2,500 - $2,999 | Total | Punggol | 3.5 |
2015 | $3,000 - $3,999 | Total | Yishun | 18.0 |
2015 | Below $1,000 | Total | Jurong East | 4.3 |
We will load the REALIS dataset.
## Parsed with column specification:
## cols(
## `Project Name` = col_character(),
## Address = col_character(),
## `No. of Units` = col_double(),
## `Area (sqm)` = col_double(),
## `Type of Area` = col_character(),
## `Transacted Price ($)` = col_double(),
## `Unit Price ($ psm)` = col_double(),
## `Unit Price ($ psf)` = col_double(),
## `Sale Date` = col_character(),
## `Property Type` = col_character(),
## Tenure = col_character(),
## `Type of Sale` = col_character(),
## `Purchaser Address Indicator` = col_character(),
## `Postal District` = col_double(),
## `Postal Sector` = col_double(),
## `Postal Code` = col_double(),
## `Planning Region` = col_character(),
## `Planning Area` = col_character()
## )
We will select 5 random rows and certain specific columns to inspect the dataset.
# We will rename the column header for better presentation.
names(realis2018)[6:7] <- c("Unit Price","Unit PSF")
check <- realis2018[sample(nrow(realis2018), 5), c(1,2,6,7,18)]
kable(check, format = "markdown", align='l')
Project Name | Address | Unit Price | Unit PSF | Planning Area |
---|---|---|---|---|
TANGLIN REGENCY | 381 Tanglin Road #09-03 | 1218000 | 15418 | Bukit Merah |
MOUNTBATTEN SUITES | 861 Mountbatten Road #05-04 | 1023000 | 15500 | Marine Parade |
RIVERCOVE RESIDENCES | 32 Anchorvale Lane #06-24 | 1076200 | 9965 | Sengkang |
TIVOLI GRANDE | 128 Koon Seng Road #02-17 | 1002800 | 13551 | Marine Parade |
KENT RIDGE HILL RESIDENCES | 62 South Buona Vista Road #04-37 | 1121000 | 18683 | Queenstown |
Lastly we will load the HDB resales prices dataset.
## Parsed with column specification:
## cols(
## month = col_character(),
## town = col_character(),
## flat_type = col_character(),
## block = col_character(),
## street_name = col_character(),
## storey_range = col_character(),
## floor_area_sqm = col_double(),
## flat_model = col_character(),
## lease_commence_date = col_double(),
## remaining_lease = col_character(),
## resale_price = col_double()
## )
We will select 5 random rows and certain specific columns to inspect the dataset.
check <- HDB2017onwards[sample(nrow(HDB2017onwards), 5),c(1,2,3,7,11)]
kable(check, format = "markdown", align='l')
month | town | flat_type | floor_area_sqm | resale_price |
---|---|---|---|---|
2020-06 | JURONG WEST | 3 ROOM | 68 | 310000 |
2017-08 | SEMBAWANG | 5 ROOM | 123 | 462000 |
2020-06 | BEDOK | 4 ROOM | 92 | 342000 |
2017-05 | ANG MO KIO | 3 ROOM | 75 | 362000 |
2020-06 | SENGKANG | 4 ROOM | 90 | 367000 |
We will first work on the income dataset. The end goal is to attain the median income for each planning area. Due to the nature of the raw dataset, we will only be able to estimate the median income, and not the mean income, for each planning area. As such, we will need to perform some transformation to first compute the cumulative number of resident at the 50th percentile.
pa_inc_prep <- planningarea_income[,c(2,4,5)] %>%
filter(level_1 != "Total") %>%
spread(level_1, value)
pa_inc_prep <- pa_inc_prep[,c(1,16,2:3,7:15,4:6)]%>%
mutate(cum_median_pop = rowSums(.[2:16])/2)
names(pa_inc_prep)[1] <- "planning area"
pa_inc_prep$`planning area` <- toupper(pa_inc_prep$`planning area`)
We will select 5 random rows and certain specific columns to inspect the dataset at this stage. Please note that all columns from $2,000 to $10,999 are omitted for ease of viewing.
check <- pa_inc_prep[sample(nrow(pa_inc_prep), 5), c(1,2,3,4,15,16,17)]
kable(check, format = "markdown", align='l')
planning area | Below $1,000 | $1,000 - $1,499 | $1,500 - $1,999 | $11,000 - $11,999 | $12,000 & Over | cum_median_pop |
---|---|---|---|---|---|---|
SEMBAWANG | 3.2 | 3.6 | 3.0 | 0.7 | 2.3 | 23.60 |
CHOA CHU KANG | 7.3 | 7.8 | 6.3 | 1.3 | 6.5 | 48.70 |
JURONG WEST | 13.4 | 13.8 | 11.6 | 1.8 | 7.7 | 79.25 |
CLEMENTI | 3.8 | 3.6 | 2.4 | 1.0 | 6.1 | 23.10 |
OTHERS | 1.1 | 1.6 | 1.0 | 0.5 | 5.9 | 10.75 |
We have computed the cumulative median population amount for each planning area. For the next step, we will copy this table to excel and perform a manual transformation. This is becuase the following steps to attain the median income for each planning area requires some judgement.
Firstly we will use excel to compute the cumulative number of resident at each income bracket, under column U to AI.
Next we will use the cumulative median income population to derive the median income bracket under column S. For example, the cumulative median income population for Ang Mo Kio is 50.65, hence the median income bracket fall into the bracket of $3,000 to $3,999. This is becuase the cumulative income population for the previous bracket is only up to 43.9.
We will estimate the median income of each planning area by using the mid point of each bracket under column T. For example, Ang Mo Kio is in the bracket of $3,000 to $3,999, hence the median income of Ang Mo Kio will be estimated to be $3,500. There is one planning area, Tanglin, whose median income falls into the bracket of $12,000 and over. For this case, we will assume that the median income is $12,000, however the actual value is not determinable, and will be higher than $12,000.
We will read the median income dataset prepared in excel back into R.
We will only retain the required columns in the dataframe, and select 5 random rows for checking.
pa_inc_prep3 <- pa_inc_prep2[,c(2,20)]
names(pa_inc_prep3)[2] <- "Median Income"
check <- pa_inc_prep3[sample(nrow(pa_inc_prep3), 5), ]
kable(check, format = "markdown", align='l')
planning area | Median Income |
---|---|
SERANGOON | 4500 |
ANG MO KIO | 3500 |
GEYLANG | 3500 |
SENGKANG | 4500 |
TANGLIN | 12000 |
As the REALIS and HDB dataset is currently by individual transactions, we will combined the 2 dataset and perform an aggregate to compute the median unit price and unit price per square foot (PSF) for each planning area. We will use the median in this case as we do not want our metric to be affected by extreme data points (e.g. extremely high prices and PSF). We will also convert the planning area values to uppercase to perform joins with the shape file later. We will select 5 random rows for checking.
First we will prepare the HDB dataset.
HDB_filtered <- HDB2017onwards[substr(HDB2017onwards$month,1,4) == 2018,]
HDB_filtered$PSF <- HDB_filtered$resale_price/(HDB_filtered$floor_area_sqm*10.7639)
HDB_filtered <- HDB_filtered[,c(2,11,12)]
names(HDB_filtered)[1:3] <- c("Planning Area","Price","PSF")
Next we will prepare the REALIS dataset.
REALIS_filtered <- realis2018
REALIS_filtered$Price <- REALIS_filtered$`Unit Price`/REALIS_filtered$`No. of Units`
REALIS_filtered <- REALIS_filtered[,c(18,19,8)]
REALIS_filtered$`Planning Area` <- toupper(REALIS_filtered$`Planning Area`)
names(REALIS_filtered)[1:3] <- c("Planning Area","Price","PSF")
Now we will append both datasets, and perform an aggregation to calculate the median price and PSM for each planning area.
Combined_Housing <- rbind(HDB_filtered,REALIS_filtered)
Combined_Housing2 <- aggregate(Combined_Housing[,c(2,3)],
list(Combined_Housing$`Planning Area`), median)
names(Combined_Housing2)[1:3] <- c("Planning Area","Median Housing Price","Median Housing PSF")
Combined_Housing3 <- count(Combined_Housing, Combined_Housing$`Planning Area`)
names(Combined_Housing3)[1:2] <- c("Planning Area", "No of Transactions")
Combined_Housing4 <- left_join(Combined_Housing2, Combined_Housing3,
by = c("Planning Area" = "Planning Area"))
check <- Combined_Housing4[sample(nrow(Combined_Housing2), 5), ]
kable(check, format = "markdown", align='l')
Planning Area | Median Housing Price | Median Housing PSF | No of Transactions | |
---|---|---|---|---|
2 | BEDOK | 950000 | 954.0000 | 2979 |
30 | ROCHOR | 1350000 | 1620.5000 | 132 |
18 | KALLANG/WHAMPOA | 446250 | 481.5261 | 626 |
40 | WOODLANDS | 371000 | 327.8934 | 1937 |
20 | MARINE PARADE | 1650000 | 1505.0000 | 636 |
First we will plot a horizontal bar chart of the median income of all planning areas to understand the dataset.
ggplot(pa_inc_prep3, aes(x = reorder(`planning area`, `Median Income`), y = `Median Income`))+
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
geom_text(aes(label = scales::comma(`Median Income`)),hjust = 0, size = 2.5) +
labs(y="Median Income ($)", x = "Planning Area") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 13000),labels = comma) +
ggtitle("Median Income of Planning Areas") +
theme(
panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "grey"),
axis.text.y.left = element_text(size = 8),
axis.text.x.bottom = element_text(size = 8),
axis.ticks.y = element_blank(),
axis.line.y = element_line(),
axis.line.x = element_line(),
plot.title = element_text(size = 10, hjust = 0.5, face = 'bold')
)
We notice that a substantial amount, 14 of the planning areas have median income of $3.5k, and the remaining generally fall into between $4.5k to $5.5k. There are only 2 exceptional planning areas, Bukit Timah and Tanglin that have significantly higher median income of $9.5k and $12k respectively. From this graph, we can understand that while there is only a slight distribution of median income across the planning areas, apart from Bukit Timah and Tanglin.
Next we will plot a horizontal bar chart to better understand the property prices.
ggplot(Combined_Housing4, aes(x = reorder(`Planning Area`, `Median Housing Price`),
y = `Median Housing Price`))+
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
geom_text(aes(label = scales::comma(`Median Housing Price`)),hjust = 0, size = 2.5) +
labs(y="Median Housing Price ($)", x = "Planning Area") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 4200000),labels = comma) +
ggtitle("Median Housing Price of Planning Areas") +
theme(
panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "grey"),
axis.text.y.left = element_text(size = 7),
axis.text.x.bottom = element_text(size = 8),
axis.ticks.y = element_blank(),
axis.line.y = element_line(),
axis.line.x = element_line(),
plot.title = element_text(size = 10, hjust = 0.5, face = 'bold')
)
We notice that unlike the median income, there is a spread of median property prices across the planning areas. The highest 2 areas, Orchard and Southern Islands, have median prices of around $3.7m. On the other hand, some of the heartland areas like Yishun and Woodlands have median prices of around $370k. When we perform the map view later, we will get a better understanding on how this translate to the geographical distribution of median prices across Singapore.
We will also take a look at the prices PSF by planning area.
ggplot(Combined_Housing4, aes(x = reorder(`Planning Area`, `Median Housing PSF`),
y = `Median Housing PSF`))+
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
geom_text(aes(label = scales::comma(`Median Housing PSF`)),hjust = 0, size = 2.5) +
labs(y="Median Housing PSF ($)", x = "Planning Area") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 3200),labels = comma) +
ggtitle("Median Housing Price per Square Foot (PSF) of Planning Areas") +
theme(
panel.background = element_blank(),
panel.grid.major.x = element_line(colour = "grey"),
axis.text.y.left = element_text(size = 7),
axis.text.x.bottom = element_text(size = 8),
axis.ticks.y = element_blank(),
axis.line.y = element_line(),
axis.line.x = element_line(),
plot.title = element_text(size = 10, hjust = 0.5, face = 'bold')
)
When we consider the price per square foot (PSF), there is also quite a spread across the planning areas. However we notice some slight difference in ranking. For example, areas like River Valley and Orchard take the top spot at around $2.6k PSF. Southern Islands dropped quite a bit to the 12th position at $1.6k PSF. This suggests that houses in Southern Islands, while still relative expensive (by PSF), are generally larger in area by square foot.
Unsurpringly, the areas with lower PSF still belonged to the heartland areas that are far from the central region in Singapore.
Firstly we will perform the various left_join
from the income and housing prices dataset to the map shape file.
We noted that as there are lesser median income data for the planning areas as compared to the housing median prices, we will need to perform an order
to arrange the non-Null values at the top of the dataframe. If we omit this step, tmap
will display the median income bubbles in the wrong planning area.
map_prep <- map_planningarea[,c(2,1,3:13)]
map_prep <- left_join(map_prep, pa_inc_prep3,
by = c("PLN_AREA_N" = "planning area"))
map_prep2 <- left_join(map_prep, Combined_Housing4,
by = c("PLN_AREA_N" = "Planning Area"))
map_prep3 <- map_prep2[order(map_prep2$`Median Income`),]
Now we will create an interactive map view of median housing price and the median income of each planning area using tmap
. We will edit the popup to indicate more information for each bubble and planning area. We will also set the style to quantile as it will provide a breakdown of 5 quantiles for more intuitive viewing.
## tmap mode set to interactive viewing
tm_shape(map_prep3)+
tm_fill("Median Housing Price", palette = "Blues", style = "quantile",
popup.vars = c("Median Income", "Median Housing Price",
"No of Transactions"))+
tm_bubbles(size = "Median Income", col = "Median Income", palette = "Reds", style = "quantile",
popup.vars = c("Median Income", "Median Housing Price",
"No of Transactions")) +
tm_borders(alpha = 0.5)
## Legend for symbol sizes not available in view mode.
The other metric that is commonly used to compare houses is the price per square foot (PSF). As such, we will also plot a map view of the median housing PSF and the median income of each planning area. We will edit the popup to indicate more information for each bubble and planning area.
## tmap mode set to interactive viewing
tm_shape(map_prep3)+
tm_fill("Median Housing PSF", palette = "Blues",
popup.vars = c("Median Income", "Median Housing PSF",
"No of Transactions"))+
tm_bubbles(size = "Median Income", col = "Median Income", palette = "Reds", style = "quantile",
popup.vars = c("Median Income", "Median Housing PSF",
"No of Transactions")) +
tm_borders(alpha = 0.5)
## Legend for symbol sizes not available in view mode.
In this report, we will study the map view of the median income in 2015, against the median housing prices and PSF in 2018, across the various planning areas. The difference in period is due to the limitation of available dataset, nonethless it can provide good insights. Please note that they are no median income and property prices for certain planning areas due to the limitation of the dataset.
Interactive Map View of Median Housing Prices (2018) and Median Income (2015) by Planning Area
Note 1: Click on planning area to view more information.
Note 2: The actual median income for Tanglin is higher than $12k, however it is not determinable and we will assume it to be $12k for this visualisation.
tmap